 Good afternoon everybody. So I'm very pleased to be here. This is my first time in Spain and also there's really a great opportunity for me to introduce Apache Killing and some user cases. A little bit about myself. So I'm the co-founder and CEO of Kali Jinx. This is a startup behind of the open source project, Apache Killing. And also I'm co-correct and now is a PMC chair of Apache Killing, this project. Let me talk a little bit about this startup. So Kali Jinx means Killing and Intelligence. So we created this open source project. And for the startup, we wanted to bring more capability for the intelligence way to using AI, to using machine learning, and to augment the analytics stuff. We formed this company three years ago with Kali Jinx of Apache Killing. And we released the funding from Red Points and Cisco at the beginning. And we closed our series A last year and we just closed our series B from 8 Loads. Actually, it's a Fidelity International. So we're also very honored to get the CR and top 10 big data startups just a few months ago. But today I will cover more about Apache Killing, the open source project itself. So today I will introduce a little bit of the background and also the technical highlights. And later on I will introduce user cases. It's a very impressive one. It's like companies to using Apache Killing to get insight from training data, from like a Parabyte scale data. And in the last five minutes, I will let you know what our startups doing. So first, it's talking about this Apache Killing project. So we developed this project when we are still working for eBay about like five years ago. You know, eBay is the biggest user of tele-data. We have more than like 20 Parabytes data in the EDW. But at that time, we have more than like 200 Parabytes in Hadoop. It not makes sense to bring all the data from like Hadoop to the EDW and to like BI application and to something. So at that moment, like five years ago, we decided to seeking for some solution. Build something on top of Hadoop like to resolve that challenge for the very huge, massive data set. But the analysts, they wanted to like sub-second interactive latency. So that is the beginning of this project. And after one year of development inside eBay, we open sourced it in 2014. And later on, we joined the Apache Software Foundation as an incubator project. And after one year in 2015, we are gradated to the top level project. So now after this graduation, we have like more than 1,000 adoptions already all over the world. So I will introduce some great use cases later. And very pleased we got several like recognized from the industry. And yeah, there's a lot of, you know, bigger users all over the world. I will talk later. So let me talk about what's kind of a problem we try and want to resolve. When we are moving from a traditional EDW to the bigger data, the data lake, that's something missing. That's because, you know, there are too many like different technologies. And you like see high like Impala, like Sparsico, like drill, like blah, blah, blah. And you have to learn each of them. And from another angle is for all the like presentation layer. They only have like a secret interface. But they have to deal with that. That's something missing. So that is what the challenge we have inside the eBay. And we decided to do something, change something. So finally, we bring the OLAP concept back to the EDW, to the ecosystem. OLAP is actually the older term. It's about like more than 30 years already. But it is very good. You know, build the cubes on top of like a traditional data warehouse. See like tele-data, see like DB2 or something else. But when you are moving to the Hadoop world, there's nothing at that moment. So we decided that. And what kind of benefits we introduce to the world? The first one is semantic layer. You have a lot of technical stuff inside the data lake, right? Define the columns, the fields, the tables. But your business, the business users, they just wanted to have the dimensions and the measures and some filters and conditions, right? That gap, that gap we call it is a semantic layer, right? The mapping, the technical to the business. And so the another one is we call it the speed up, the SQL acceleration for big data. No matter like Hive or like even like Spark SQL, it's still very slow when you are having a very huge data set plus very huge concurrency, right? Your cluster is always like that, crushed, right? When you have like thousands or even like 10,000 concurrency. So that is we've introduced high performance and high concurrency using that technology. And another thing is we're still using the NCC code. So that means your analysts do not need to learn anything new about the technology. They only need to know it's using their, you know, favorite daily two, like Excel, like Power BI, like Tableau, to kind of access that billing status and get a very fast access. That is the key to introduce it to that. And how we do that? Let me give you some like technical highlights. So the fundamental concept behind Apache Keeling is OLAP Cube. So the cube, you know, we will build some of this aggregated result. We call this, and you can do the low app or like Joe Dunn or the slides and the dice. And that's why we call these OLAP cubes, right? This is very, it's about like maybe like 30 years old, right? And then you think about that, your analysts will always get insight from that data and that data is from different dimensions, which day, where, which kind of product, and how much is, how many you sold, right? That is, we call it OLAP. And what we introduce to do is, so then we think about that, there's like a lot of like tables sitting in your hives or even like bicycles. And then we're sitting, you present as like star schema or like snowflake or other like schema. And it makes sense is to pull the data and to do the manipulation and pull the result somewhere. And next time when the same query are coming, it can just get that result back from there without touch the original half table, right? Without to kick off another map reduce job. So that is what we call the balance space and time because it will require some additional space to store the result, the calculation result. But the benefits is it can speed up a lot for the queries, especially for the like queries run every day or even like every minutes for everybody. So that is architecture is particularly the self. So we can consume the data from hive at the beginning. Now it can consume from any like SQL or Hadoop. Even today we can, we have a plug-in can to a consume data from IDVMS. And also you can support from consuming from like Kafka or streaming. This is a very useful case can build some near real-time analytics. And also today you can consume from some like cloud source. And then inside the key, you just need to define the data model. Tell the system which is dimension, which is a measure and how to calculate it. And once the system has all the metadata, it will automatically talk to the cluster, your Hadoop cluster, using MapReduce or today you can use it like Spark as an engine to pull the data from source and to do the calculation and put the data to the edge base as a key value storage. And that is what we call the cube processing. And once that data down, any like BI tools, it can consume that interface. We have an ODBT driver, JDBT driver and a REST API. You just need to submit to the NC SQL to the system and the system will directly gather the data from edge base without any high touch, without any MapReduce job. So that is the concept and I'd like to talk more detail about that. So this is the SQL, right? We'll just select some from join two tables and have several like group by and order and something. And without any calculation or index, that will be like table index, table scan first, right? And then to the join, to the filter and then to the aggregation and get the result back to the kind. But if you have like one billion loads or even like a hundred billion loads, that will be very, very slow. And even if you can like distribute your data to like a very huge distributed system like Hadoop, when you have like thousands or even like hundreds of thousands of concurrency at the same second, the class will be very busy to shuffle in the data. So how we resolve that? We do the calculation. So pre-calculation, it will be read the data and to the join and to do something first and put over there, right? And the next time it will just get the result. That is we call it pre-calculation, okay? And we not only support like the star schema, we also support a very complicated data model like Snowflake, like even other like models, okay? And how to store it? So we're using EdgeVase as a storage. So EdgeVase is a key value storage. So the key actually is your dimensions combination. If you have like 10 dimensions, actually we will do some calculation for that. And the value is actually your metrics, the metrics result, right? So if you have some, you have count, even like have a distinct count, we will save that value over there. And we do a lot of hard work to resolve several very hard challenges. The first challenge is data expulsion, right? If we calculated as every combination together, that would be disaster, right? But we do a lot of, we call it partial cube and we do a lot of optimization for that, can reduce that. So from our best practice, most of the production deployment, it just cost like maybe 20 or 30 percentage of the source data. So think about that. If you have like 100 terabyte high of data, it is very slow. But today I asked you, hey, just give me like 20 terabyte as an additional storage, but I can guarantee you for like 19 percent, how the queries can return in one second. That was to do, right? So that is how we do that. So once the query is coming, it just gets the result. It just has the attributes table scan. And if you deploy this attributes as a dedicated for the killing, you can optimize it very much because that is a lead-only system, right? So you can like add value, how say, a bigger memory for the cache and like do other like stuffs. And so that is the result. The result is you could see compared to some like SQL and Hadoop. So the first diagram is you could see this is from the star schema benchmark. Okay, this is actually converted from TPCH. So you could see most of the killing queries, it can return in just one second, but others it will exceed very much. And the second chart here is we call X is actually the data scale. So that means you could see, no matter how many data is, like from like 1 million to like 1 billion, our performance is very consistency in like the second level. But with any like SQL and Hadoop, it will reduce a lot once the volume is bigger and bigger. And also we integrated with the BI tools very well. So yeah, today you can using like any like open source BI tools, like CQ, like Zeppelin, like Genuple, like even like a superset, like Interactive with killing services. And also if you are going to go with a commercial one, most of the commercial BI tools already be certified and be integrated. And I can tell you is just in last week, Microsoft released the Power BI, the latest version, it packaged our enterprise connector inside that. So like you can easily to use that. But that is for the commercial one. Okay, so now that is a very basic, right? And I think you still have a lot of question season. Can you do that really? How big are you the cases it can serve? I give you some examples. So we have a very, very huge, you know, thousands of users all over the world and it can fit the various scenarios from like behavior analysis, for log analysis to the financial assets management to advertising. There are some like things using DPM, DSP, using this technology, or even can serving for like real-time analysis or gaming or something else. Okay, that is a lot. We publish a lot of the user cases and if you have interesting, you can go to our website to see the Power BI page. There are a lot. And I'll give you some like detail. So first one I want to talk is Totiao. This is very interesting one. This is because most is a top one news application in China and I believe it's almost in the world. You could imagine that every Chinese people today are using this application to consume just the news, everything, okay? So they have a lot of a lot of user behaviors and they have one scenario is we want to have some video and once you are looking at it like video, they want to get some impression data, right? And they build this impression and insight based on our technology and think about it and guess how big that table is. One cube contains more than three Chinese blows data and 90% of the queries are still written in one second. That is really amazing. This is very, very huge, just for one cube, one table, okay? And also they have like hundreds of cubes serving more than like thousands of analysts in National Wild, okay? And today this data is from last year. I believe most of them has already changed it. And the benefit here is you could see the last one, this is the second one. It can saving for the cluster resources. So think about that. If you have like 1,000 analysts and they will supplement the queries and then the queries will be generated like the MapReduce jobs. And think about that. How busy your hardware cluster is, right? You cannot manage that. And today we just calculated once and actually can serve in any time, right? Because that result is already there. And from our practice, more than like 80 or even 90 data, it can just be used in the batch model to get it or even just using like the near real time data, right? It can serve in for that. So that saves the cluster resources very much. Okay. The second one is we call this Maitreya. It's still in China, but this one is really healthy. This is the biggest O2O services now. I have to say, without such like application, we cannot survive in China now because what kind of application we are using? So when we are going to the office, we're using something like Uber. They have some services for that. When we wanted to... During the lunchtime, we want to order something they can deliver that in just like in half hour. And if we wanted to see a movie, we can buy ticket from this online. So you can think about that. That is a really, really huge one. And today, their application on top of Apache Keeling is already counting more than Parabyte data. This is just the cube data. So you could guess how big the source data is, right? And you could see the latency. The 19% latency is less than 1.2 seconds. That's really, really a great result. But yeah, they do a lot of optimization behind that. They're using lots of technology, even like using some SSD to speed up for the hardware side. And also, that concurrency is very huge. They have like 3.8 million square meters per day. Most of them are just in the daytime. So this is another case. And another one I would like to introduce is Yahoo! Japan. So you know, Yahoo! Japan is most visited website in Japan. And actually, this is e-commerce website. So there are a lot of machines selling through that website. So this is very big, like Amazon or eBay in Japan. And they have that data. And previously, they wanted to open that data through that machine, machine vendor, right? Because everyone wanted to know how about the GNV, how about how many customers with the customer and how many sold it every day, right? Previously, they are using like Hadoop and that one is really, really slow and cannot extend to a huge group. So later, they come into us and build an application. And today, you could see, they just published their blog in two weeks ago. Sorry, that version is still in Japanese, but we are asking them to translate it to the English version. I think it will be published very soon from their engineering blog. They talked a lot of detail about that. And you could see that the diagram, that is query latency. So you could see most of them actually is less than 800 million second. That's really faster. And the interesting way, the best, you know, practice from this case is not about that. It's about, we call it, computing and read separation. So with some reason, that data is sitting in California data center for the data, every over there. But they are consumer, that user is in Japan, right? But if they build some application and deploy in California data center, because that is a distance, so that the performance is really bad. From our experience when we are working eBay, it will take one report, it will take more than like 10 seconds to get the result. So how they do that? Very interesting thing is, they are huge, Hadoop cluster is just taking the responsibility to do the calculation. That means to read the data from Hive and build the cube and store it in the temporary. And then using this copy to another cluster deployed in Toyota. And then the application is just, you know, built on top of that, the Japanese cluster. And in that case, everybody in Japan are happy for the performance, for the latency, because they can get the report result just in one second. That is really, really great. And more beyond that, you can imagine that you can use this capability to calculate the data inside your data center and just ship your result, maybe just like say, the aggregated result to a cloud or to another area, to another place and build something over there. This is doable and we have a lot of practice already. And even from our cloud version, it's very easy to deploy all over the world. So this is very interesting. And another one is for Xiaomi. You know, Xiaomi is a smartphone vendor and they're processing a lot of smartphones and other smart devices. And people told me that there are like nine boats, like for this, since it's very popular in Spain. This is very interesting. And they have like more than 80 business lines today relying on our technology. They build something they call a data factory. That's a very interesting name, right? And to get the data and the calculation and serving for different users' networks. They have different, you know, services, so the killing stuff is more focusing on the batch of stuff. They also have another branch, you can see that, either using Kudu and like, yes, for the streaming stuff. And then it can serve for different, you know, queries now. They will also publish a user case later. That's very interesting. Oh, by the way, most of the smartphone vendors are already our users already. I do not confirm the app yet, but I can tell you, Huawei, Xiaomi, OPPO, like Vivo, most of the top smartphone vendors, they are using our technology to help to do that, to get insight from that, right? And then the user case is for Structury. I think that maybe someone already knows that. They are housing web application, you know, for everyone, right? Everyone can go to their website to just with a click, without any technology, can build your own website. There's something like Taylor Swift or some other like, you know, big stuff. They are using that technology to build their own website. So once you have your own website, what do you want to do? What do you want to know? You definitely want to know. Well, everybody has that one, right? I want to know that. Maybe in the night, I want to check how many people waited in my website, right? Any comments, or anything else, right? So everybody needed that. So they build this one on top of AWS. And previous, actually, they are using Redshift and they switched from Redshift to this one. This is because, you know, we are a pretty calculation technology. So it can help you to just calculate and store on S3. And the next time when the query is coming, it just gets the result without any like computing, right? So that is a very good one. And they have millions of users of the water and that application supports them very well, okay? So next, I will talk about this, this is still, it is OLAP technology and there is still a lot of Halsey. It's a limited limitation or maybe some obstacle, so we want to try to conquer that. So we have a lot, we have discussed a lot in the community and we are trying to want to bring the approach to the next generation. So next generation, there are some ideas we actually already have that happen in the community. Some of them we are already under the developing, some is still under the discussion. So first is for the like, we call it a new storage. So the open source killing is using HBIS as a storage. It is not the best storage yet for that one. For example, HBIS do not have the secondary index. They do not put a very complicated query on top of that. So think about that, if you have some like, high cardinality dimensions, then you want to get a very fast result from using that well, right? So think about that, you want to query something from a filter with like, your phone number, with like a page ID or like some other like, high cardinality. It cannot perform very well. It only can perform for one high cardinality dimension today. So we discussed a lot and we are trying to remove the HBIS dependency and to move to another real columnar storage. We call it, the package is one target and this is already on the way and you could check our email list to see what will happen. But I can tell you is our enterprise version is already replaced with this. So that is our company will contribute this back to the community, okay? Our enterprise version and the cloud version actually already know any edge base for that. The second thing is for a real-time support. So today we can consume data from Kafka. So it can serving for the near real-time analytics. It can serving for, we have some use cases, can serving just like, say one minute latency, you can get the data back. But from the community, they want to have more to reduce the time to market. And so even with another discussion, wanted to support the real-time support. So this is, we are talking now and what should be happen very soon. And another thing is, containerized, right? So yeah, talk a lot, Kubernetes, right? Everybody are moving to Kubernetes. The deployment for the Hadoop, because Hadoop depends on the Hadoop line. In that case, the deployment is still very complicated. So we just published one blog. We can now submit the open source, the Spark jobs, the Killing Spark jobs to the Kubernetes cluster. It already works in that. And we try and want to make it more easy for use. And this is already happened. And most of them, we are targeting to package as maybe release as like an Apache Killing 3.0. I think it should be having in the next half year. So this is talking about for the Apache Killing. And now I will talk a little bit about the company itself. So see what kind of add value we add on top of the open source project. So let's talk about that. I think everybody know this diagram very well. We put a lot of efforts to do the data manipulation, to clean up the data, to transform that, to aggregate them, to store somewhere. I put a lot of efforts over there. For example, one of our biggest customers in China, they have one team, 500 people. That job is to just get the data, clean up it, and store somewhere, and make some reports for the business. This is just for that, right? But that does not make sense. And it will take a long, long time to implement some project. But you can think about that. The world changes so fast, right? You cannot just see it over there and let things go. You have to find something new to handle that. So our philosophy is people, the human beings, is very good at for decision making, give you some information, and you will say yes or no. That is very easy for you to do that. But you are not good at to do some writing work, right? You will get crazy, right? If you do the job day-in, day-in, every day do the same way, right? But the machine is very good at for the writing work. So the way is how we can train the machine to do that dirty work. So that is what, and yeah, the benefit from the recent, you know, technological emulation, so the machine learning, like deep learning, we actually can leverage that. So, our idea here is we can use machine learning to do that. So, for example, when you are, you know, to visit different websites and show some interest, and then next, when you go to like Amazon, I will go to like eBay, it will recommend something to you. Hey, here is something you're interested in. Do you want to buy? In Amazon, you can just one click buy, and the money is gone, right? But, what, why, you know, the most technological stuff, in the data warehouse side, the data analytics side, still need a lot of, you know, human work. It doesn't make sense, why? Yeah, think about that. You're driving a Tesla from home to office. Well, it's autopilot, right? You are very satisfied for the high technology, very exciting. But when you are getting to your desktop, when you're, you know, bring up your laptop, what will happen? Oh my God, right? A lot of the secrets, right? And pull some data, right? You're just generally, you know, something, like years ago, you are doing it today. And you will see a lot of other people, they are doing the same way you are doing. They're not really, really ridiculous. That is what we think we can change. So what we want to do is we can get the SQL history from your application, if you already have that one. And we can learn from the data schema. We can know the data profile and our patent machine learning engine can recommend the data models to you. And for example, you just need to bring, like, say, 1,000 SQLs to our engine. And the engine will tell you, hey, here's like, say, 5, or maybe 6 data models can help you to speed up those queries. Do you want it to go? You just need to click. Everything is done. So that is what will happen over there. We put a lot of, a lot of efforts on top of that. And this is already introduced to our latest release. And then we can serve for the BI scenario, say, reporting for dashboard. And also can see, like, the real-time analytics. And it got a benefit from the high performance and high concurrency. Many of our customers, they are using us as a data as a services. For example, we just co-presented cases from a China construction bank, the top two bank in China. They build mobile application, enables every employee can get just like a matrix from their finger from a mobile application. And when they type, like, say, generally, it will pop up like the result for that. The back end service is relying on us. This is because we can serve in the high concurrency and high performance because they have more than 400,000 users in China. And the entire platform can run from on-premise to a cloud or a container. So here is the way we call this machine learning augmented analytics. I explained just. And this is already available from our latest release. It can help you to reduce a lot of our efforts and get home earlier. And the next thing is for the cloud. I will not introduce too much about that. So that capability can run on cloud very easily. And we can already run on Azure, can run on AWS or like Google Cloud and other vendors. And just the one thing I wanted to tell you is when you are moving to the cloud, what kind of things do you want to avoid? We call it resources. You can see 1,000 notes over there, but just for one hour to calculation. That means you will waste a lot of money over there. So we introduced one feature we call it auto scaling. So when your data picker is coming and our system will monitor that the usage and the resources and they will help you auto scale and to help you to get the job done. And once the resource is getting down, we will also recalculate the cluster itself. This automatically can do that. And I give you two user cases. So this is for the commercial one and I think a lot of people are interested. China Unipain is something like a visa. There are transaction numbers already over visa. The amount is not. But the transaction number is already huge now. So they built an analytics system like 10 years ago using IBM Cognos and serving every analytics application for their needs. But they have thousands of IBM Cognos cubes and that management and maintenance is really, really a disaster. And the Cognos cannot serve for the distributed system. So they cannot serve for that. So we are a very hard worker between these two teams. We already have this down and I can give you just one number. We are just using one cubes replaced more than 800 IBM Cognos cubes and the performance is better there. So previous, they need like, say, 10 minutes to get that report. Now they just get it in like, say, seconds. And the previous is that it took a time to market in like, four days. Now the analytics, that benefits you, they can get data just in half day. It's very, very fast for them. And another thing is for retail. This is actually for the Japanese you know, Coase vendor for the Unico, right, actually. So in China, there are many different channels to selling their product. They have like local stocks. They have their own like they also have a lot of e-commerce you know, channels. Say like Alibaba, say like JD.com other like WeChat. There are a lot of things. So there you know, manager, there are business owner. They definitely need to know the performance of different channels and the performance of the different like products. So that is something we build to build that. And they just using like one, about just less than like one month, they get these things done. It's run on Azure and using the HD inside and the Qlidgen and also Power BI. Okay. So last, I would like to see we have a very great you know, partnership with all of the world. We certified with all of the you know, Hadoop distributors. We also certified with all the like BI vendors. We also are kind of run on all the like cloud you know, privateers and really, really, you know, exciting is we have local partners mobile solution in Spain and we have others also we are talking now and really, really exciting. Yeah, here is. Okay. So this here is my presentation. I'm not sure if we have a still have time or one minutes. Maybe I can like for one question. Any question? Thank you very great. If you were in a situation that you have a team manager and that ask you which is the added value that killing has instead of which will be your answer. Well, I'm more focused in streaming applications. Yeah, I know. So build is a really a great project and does a very good at for the real-time OLAP stuff and people can leverage it for like alerting or like the advertising automation or something else. Really, really good. They have some limitation is that is what killing is very good at because from my you know view is designed for machine. So if you want some like alert system, if you want some like automation get from like million seconds and take some action. Joe is the best choice for you. And the killing is designed for analyst for people. So that is why we are introduced like the multi-dimensional model. So in jail, there's only like one table. So you have to normalize the data to get one table. And the secret support is not very good. You cannot join the different tables. But killing is something different is you can use that to join like even like hundreds tables. Even like very big one. That's the difference. I would like to say we are good at for different segment. Yeah. And we have more detail discussion about that in our mailing list. If you have interesting, you can check that out. Thank you. Okay. Thank you very much. Thank you.