 Hey, good morning, everyone. It's really a great prayer for me to be here to present again. Last year, I will be here to talking about our open source project, Apache Kali, and the user cases, and a lot of great ideas from there. And this year, I would like to talk to another topic. We'll call it Simplify Data Analytics Over the Cloud. I will introduce how our innovation is, how our idea is, and what kind of innovation can help you to simplify your analytics can please your life. My name is Luke Han. Like I said, I'm a co-founder and a CEO at Kali Jinx, and also the credit and former PMC chair of Apache Kali, and also the ASF member, and also the Microsoft Regional Director. This is a community law. It's not an employee of Microsoft. It's something like Microsoft MVP. And a little bit about the background of our startup. We call these Kali Jinx, mean Kali, and Intel Jinx. Well, Intel took the Intel, so we only can bring the allegiance. So we combine these two. We try to want to leverage our open source project, the Apache Kali, and bring the Intel Jinx capability over there. So this is a company. So our mission is we call this AI Augmented Data Warehouse. Because we believe this industry, the data industry, is still very in the old technology day. It still requires a lot of human efforts. Something like you are still running a script maybe 10 years ago. You are using the technology maybe like 20 years ago. So this is what we try to want to change the world. And we lose the four-runs funding from top VCs, like Red Point, Cisco, like Fidelity International, you know that, and all the code too. And we are running exactly the same model as Spark Databricks and Elactus. We have a global open source community. And we offer our commercial product to run a business for our customers. So this is a thing about the background, about our company. And in the past three years, we've gone up a lot of great customers. Sorry about that, most of them are Chinese, but most of them actually are global Fortune 500. Very big. And their use case is very amazing. Huge data's over there. And the Lexi technology already cannot serve for that today's requirements. So they switched from the old technology to us, or it built something new to that. We are mainly focusing in the four industries, the financial services, for sure, for banks, insurance company, for the securities, and also payment system. And also the manufacturing, like General Motors Shanghai, like the Volkswagen China, and also Porsche. I like this car. And also all the smartphones, like Vivo, like Huawei, Opel, and Xiaomi, for sure. And also we have a lot of retail customers from McDonough to Starbucks to KFC. Also, we will have other coffee brand from UK very soon. And also we have a customer in United States. We are talking about to the insurance company, the banks, and also the other great financial firms will announce very soon. So this is the background. In the past three years, we've gone to trust from those customers very well. So let me talk a little bit about the beginning of our story. So the beginning story is Apache Kearney. This is the open source project. In this year's data and AR landscape, you could see the Apache Kearney is placed into the framework of the entire landscape. As the same as the Hadoop, the Spark, the Young, the Metos, even like Kubernetes, it is an infrastructure, a framework for the data itself. So this is the open source operation. And yeah, this landscape is actually published many years. So you can check every year over there. And this community is very great. So I will choose two things. The first is like the GitHub stars. How many, in the past five years, you could see a lot of people like our open source project. But the Gila issues is actually the real world, right? So that means we develop a lot of bugs, right? But this also means a lot of customer or users are using this open source project. They need to repair, fix the bugs and also bring some new features. So you could see the open source community is very great. And I can give you another number. It's adoptions, thousands adoptions all globally. So from very big names, even like Apple, Amazon, like Microsoft, like Yahoo Japan, like Cisco, like Walmart, like OLX in Europe, and also a lot of giants in China from the very huge one like Baidu, like Alibaba, like Tencent, and also like WeChat, and like QQ Music, a lot of them. Those users have very huge data challenge. We always talking about like the hundreds of billions of loads of data. So in such like volume data, how you can bring the fast access to those data, how you can enable your analysts can get insights in seconds level. This is the open source killing to resolve the problem. And yeah, this is a summarizer where we position the open source project. We see we are help our users to managing your golden data over the data lake. So everybody know, right, there's a garbage in, always garbage out. Then you stop like hundreds of terabyte, even like a perabyte data to your hydro cluster. But how about the valuable data, most valuable data where is, and how you manage that, how you can govern that. So this is coming from the open source project called Apache Kali. And we developed it five years ago in eBay, and we open source it. So we introduced the semantic layer over the data lake. So you can build a semantic layer for over the hydro side. And then it's not enough. This is a abstract layer, but not enough. The most challenge is for performance. So we're using Mollap technology, that means pre-calculation. So we put the data from source, say Hive, say like a sparsical. And we're using the semantic layer because we already have the multi-dimensional data model over there. So we calculated that, and even we can build the index over there. So then we can start to, like EdgeDefense and something else. And next time when the same query coming, it not touch any like Hive or even find any like MapReduce job. It will directly go to the result and back it in seconds level. So this is the core concept. And to think about it, you do not need to manage all the data in a data lake. Only you need to fix or focusing in like maybe 20% of those data because those 20% of the data can give you like a 80% of the value, right? So that is the rule. And also, yeah, today it's supported streaming also. So we can consume data not only from like a batch model. We also can support in the streaming model, can consume it from like say Kafka. And our latest version called Aparticating 3.0 is on the release now. We already released two version in the beta and we'll release a general available version in this year. We'll introduce real real time. So that means the time to inside will be reduced to like a second level, even like a million second level. It will something close to jail there. Okay. Okay, so this is the concept behind that. But now I want to try to, oh yeah, before I talking about the cloud, I will talking one use case. So this is the China Unipay. So we helped the customer replace the IBM Cognos application. You're doing two collisions cubes, replace the 1,200 cubes. It's not about the cubes. It is about ETL jobs. To generate those like thousands of cubes, you need thousands of jobs to producing that right. Those overhead is very, very huge. But now they only need one ETL job and generate like two like cubes and exceed the limitation of the IBM Cognos. Previously, they only can analyze no more than like 12 dimensions. Now they can bring 100 dimensions over there. And now they can bring like a much yes data into one cube and the others can get those data in seconds. So this change a lot. This is a gaming changing technology for them. So they are very happy because they do not need to manage a very complicated application. Previously, they have like nine machines for the IBM Cognos and the press many machines for the DB2 services. Now they only need one type of cluster is good enough. Okay, so this is a use case. Okay. And everybody know that data is a next oil, right? Everybody talking about that. Yeah, so another trend is data is moving to the cloud for sure, right? Everybody moving data to AWS, to Azure, to Google, to somewhere. The problem is also same. I call it, we call it the chaos happen again. Well, think about that. Where's your data? Well, it will start in a different database in the cloud, right? MySQL like Postgrease, Aurora and S3 maybe whatever. Or maybe like a streaming. And your boss came in to you, said one cloud vendor is so risky. So we need a multi-cloud strategy. That mean you need a redundancy. Oh my God. Then your data will placing everywhere, right? One, maybe two cloud vendors is still not good enough. How about three? And how about a copy of data? And how you can combine this data? And your business user, your analysts will vary ping for that because they cannot get a single source of truth for that data. And they have to learn different technology. They have to learn Java. They have to learn Scala. They have even like, hey, this is like a Python. And we give them a very fancy name. We call it data scientist, right? It's actually missing. I will first we call it the intelligent semantic layer. Semantic layer is very important for the data, right? It is very older housing series, but many people like forgot that but everybody is coming back in the past like two to three years. And it's very easy for high concurrency or for high performance. But how you can combine the high performance and the high concurrency together in the cloud? How many concurrency of your red shift cluster? How many concurrency of your snowflake red shifter? Think about that. I think that a lot. And well, no more Hadoop on cloud. Hadoop is very, very great, okay? It's very, it's promising technology and can handle like hundreds of petabytes data. It's very good for on premise but it's not good for in the cloud. The maintenance, the management and the overhead is very, very heavy, right? And I know the, I think you guys know the price of the cloud Hadoop, how much is, okay? If you want to store like say one terabyte in a cloud Hadoop, how much is, think about that. And another is automation. If you bring everything to the cloud, you want to automation. So this is a missing, missing part. And what's kind of the benefits we can form the cloud? First is the cloud net we call it. Cloud net, that means you can put your date, that means the storage and the computing will separate it. And the storage will be very cheap and the computing should be always like how say provision on the fly to reduce the cost. And the second, that is coming from the electrical siding, the scaring, right? So you do not need to keep all your nodes running over the night. Actually, nobody access data. And another is a lower TCO. So while you bring the data to, bring your application and the data to cloud, always start from a TCO, right? But it's real, how much is, you need to calculation. So we introduced our Kali Jin's cloud. And this is the idea is that we try to want to simplify those things. So we know different data sets sitting in different storage, every cloud. Maybe you just don't have an LBS, but you still store the data to a different data store, right? Say the database, say the files, say like even like streamings or something else, why? And so what we try want to do is the first thing we try to want to bring the semantic layer. So that means we bring the piece of data, piece of the data definition from different source. And then we can build a, we call it a multi-dimensional data model for the user. So this is the first thing, we unify the data, the multi-dimensional data model, the data view, right? And like the open source one, we use Spark to process all those things and we will persistent those calculated result, the cubes, the index and anything else to the cloud storage naturally. We do not start to edge-deaf as any more in the cloud. We directly start to S3, directly start to the actual browser storage. So that will reduce a lot of overhead over there, right? And not good enough, we introduce the one thing we call it is AR augmented engine. So even if, even though the below two parts are great, but you still require people to know about the data, know about the technology, have a skill set for like Scala, Spark, whatever, and also understand the business. But actually you cannot find those people. Those people should be very expensive, right? So this is we introduce, we call it AR augmentation engine to automatically to build those selected model. So I will introduce the detail later. Then, yeah, we persistent for most of the popular cloud vendors already from Azure, AWS, Google and also Alibaba. So then we can offer the SQL interface, the NC SQL interface to all the consumers from BI tools, dashboard, the open source notebooks and also the machine learnings. And even like, yeah, we put like a timestamp floor, we can speed up for the data preparation or also kind of to serving for a huge data consuming from our side. And also I put WeChat. So we also can enable the WeChat application can interact with your cloud data. Okay, it's very easy. So this is the architecture of our cloud. This is our mission, one to simplify the data analytics over the cloud. So let me talk about the detail. So the first, we call it is a Spark Natio. So now our latest version of the cloud, as Keejin's cloud, we reduce the Hadoop overhead in the cloud. We read the data, we build the cube and the index, we store the data and even we're serving for the queries all rely on Spark. No Hadoop, no MapReduce, right? No Yang, okay, and nothing. Okay, only like the Natio, you know, Spark and also the cloud. So this is very simple. It can bring a very great, I can give you an example. So now you can provision one cluster in several minutes. For example, in like AWS, it can definitely less than like five minutes can give you a cluster, right? In actually a little bit longer, but still around like five minutes. Compare our previous version, our own version. The previous version is dependent, it's relying on Hadoop, like EMR or HD Insight. It will take like a 30 minutes at a list to provision the cluster. And the overhead is really, really huge because you know, when you're provisioning a Hadoop cluster, it will provision Yang, provision like a ZooKeeper, provision like a MapReduce, provision a lot of things. And if you wanted to use it like Hive, oh my God, it's still Hive services like the Hive metadata services, a lot of services overhead over there. But now you do not need that. You just need a cloud storage and our notes. It's good enough, okay. So the second is, we call it is a semantic layer. I'm putting like my more than 20 years in the data warehouse domain. So we know that to govern the data, to manage your data, the key is you build a unified semantic layer first. And to unified across your organization, so one of my customer called China Construction Bank came to us, they spent about more than like 10 years to do one thing, to make sure every KPI, every metrics, the definition is same across the organization, right? So when you are mentioning, hey, this is like a GMV, so everybody should talk in the same language, the same definition. And you should have one single source choose to find that data, that metrics. So this is what we call the semantic layer, right? And we also can, you know, to serving for very complicated, we call it calculated metrics or even dimensions. So you have the source data and your analysts always will serving for the business and they always wanted to have some metrics on the fly. So they should not ask people to, you know, bring those like metrics into the source data and processing everything, they should do on the fly, but just something like when they are using Excel, it's very easy to put an expression with there. So we support those also like we call it computed column, you know, the calculation column over there, yeah, so you can see that's very complicated. You can put it, this is like an exchange rate, like the case when, or you can put like year to day or something else, right? Okay, so this is a semantic layer, okay? And how between us and also like the, we call it a cloud data warehouse because they also had, so we try wanted to build those semantic layer and also the overlap over the cloud data warehouse. So from the sourcing, it's very easy and using like the cloud data warehouse, say like a Redshift, Snowflake, or even like S3, it's very great using those tools to landing data to transform the data. They are, in this domain, they are very great, especially for like the MPP technology. You can write very complicated cycles to do the transformation, to do the clean up, right? We cannot handle that, they are very good at that. But we can help those, we can build the aggregated or an index over that and also the semantic layer. We can serve in very high concurrency. We can guarantee for the SQL response time and we store all the data to the cloud storage. So this is the position and we can serve in Excel for the tableau. So yeah, if you only have like, say hundreds of records, well, do not use us, it's overhead for you. But if you have like millions records, definitely we will have a lot. But not only for the like big data, we also can serve in for the middle side, even like a small data size. When you wanna have a semantic layer for that one, it's very easy to use. You can very easy to introduce those two to your analysts. We call those analysts actually unsatisfied over the past decade because everybody one of those people to be a scientist, to be a programmer, but they actually only like SQL and Excel, right? They don't like programming. So this is what we try wanted to bring our add value to the cloud data warehouse. So let me just show a demo, okay. Yeah, so this is a demo how we can pull it from Snowflake and build a semantic layer over that, okay. So yeah, just a little bit. So you can see, this is we load the data to the Snowflake. It is MPP data, right? And then we can, you know, native connect to the Snowflake and we just reported the credential over there and we can pull all the metadata back. And then we can build the semantic layer. You can see the semantic layer for the Snowflake. Okay, it's very easy. You can see it can build the, this is a star schema, I think, yeah. Yeah, and this one I will talk about later. We not only can serve for the Snowflake, yeah, this is a complicated, you know, a calculation column, okay. So you can build this one and you can serve in for the visualization very well using our built-in engine. This is a very, we call it the AJA BR2s, and also kind of serving for Excel or some other like a commercial BR2s. I can, I will skip over this. So this is, you can see it's very good. Well, happy to build the semantic layer. You can build the cube to speed up for the queries, but also like see the index. But like I mentioned before, how many people you need to do that, right? So I can tell one another story that's very interesting is I have a very, one of my greater friend. He is in the, so who said. He's very exciting last year. Came to me that, hey, I bought a Tesla Model 3 and he invited me to, you know, just over there and he showed me how to autopilot, boom, and nothing touched right, and it's very fancy. Oh my God, it's from the future. And I asked him another question, one question is, what happened when you're driving to the office, you know, using your Tesla Model 3? What will happen when you open up your laptop? The problem, okay. So the car industry already changed, autopilot, even like those Tesla, you know, factory, that's actually very, no human over there, right? And only like several like managers over there. So that's the problem is coming to this question. So actually the reality is, is this. So I just told my friend is the only thing you change is you have a fancy Mac laptop, but your script is still 10 years ago. The ETL jobs, the CICOS, the databases, right? It's many, many years ago. So you cannot do that at scale when your business is growing. So yeah, I'm from China, so the economic development is very good and a lot of business is growing very fast over there. How you can find skilled people over there? You cannot, right? Because the business, it will grow exponentially, but the people producing it cannot. So you cannot find those. So the only way is automation. So this is what we believe. We believe that automation is the key. And this is also coming from the gardener, from last year they said, the augmented analytics is the future of the data and analytics. How we bring those idea to the product. Actually we released our first, how say, augmented analytics capability. It's about like one and a half year ago, more before gardener released that report, okay? But at that moment we just called it like auto modeling. So think about that. In a bigger organization, everybody are consuming those data. You already have like a lot of like reports and an analytics system. So actually those things, those like behaviors, logs already be checked it. So something like when you go to like Amazon, when you go to eBay, when you do some like a search or buy and it will recommend something to you, right? Because they know what kind of, because for example, you buy a car and next time you go to like eBay, if you even know that, it will recommend that you may be buy something for this car. Right, this is what we call the recommender system. It's very good for the commercial side. But what happened for the enterprise side? Actually your boss, your business system, the business users only care a little bit like metrics or data. So those things actually happen over there. And actually we can use those way can to learn from the history and to automatically identify the we call the most valuable data. For example, we don't hold the data. And also we can know what is the code data. For example, like one more customer, they said they produce more than 10,000 reports every year but nobody dare to say, hey, those reports should be like deleted. But now we know because that's the usage because we know the weight of the customers. So you know that, that you can have that. And we can continually learning from the system for that one. Okay, so that is the algorithm behind of that. So we can import the SQLs from your Oracle, from your GreenProm, from your Hive. And also we have a push-down capability. We call it smart push-down. At the beginning, if there's no history for you, we actually can plug it to your application and just let your user to use that more and more usage, more and more logs we have and more and more dataset we know. For example, at the beginning you'll say, hey, I want to build like 10 dimensions data model for you and you just ship it to your users. And actually, they only care about like, see this is three dimensions and another like five metrics after like several months is and you collect a lot of logs. The system can recommend that you, hey, here is a better way to reduce like those data. Or it can help to identify, hey, here is another hot dataset, but those datasets, the SQL duration is still very long. See like 10 seconds or maybe minutes. And the system will recommend you, hey, do you want to speed up for that? If yes, just click one button is good enough. Okay, so yeah. And this is we call it a rule-based engine. So you can define the rule. That means the rule is we will, because we can, we will collect all the SQLs and you don't machine learning to analyze those data. So we can find, say, between like, or like greater than like a three seconds SQLs, we want to speed up. Or one of my customer from the China Machines Bank said, every SQLs, my big boss touched, we have to ensure it will be speed up. Something like that. Then you can define those rules inside our platform. And let me show a demo. Okay, so those SQLs, we capture from the system. It may import it, okay? And we can accelerate it. And there's a rules behind all that, okay? And after those accelerated, and you can directly drag and drop from your any like BI tools, it's very fast. For example, yeah, previous Tableau and GreenProm, you know, combination, it will, very slow. And we plug over there, and just run one time, generate all the SQLs, and we can accelerate it automatically, okay? So yeah, this is also happening in the cloud. So now let me talk about another topic is for the high concurrency. So that's good, looks like good, and you resolve the storage, you resolve the like human efforts. And when you bring your data to the cloud, you definitely want to serving thousands of users all over the world, right? But if you are using one MPP technology, it cannot. Okay, one cluster only can serve in a very low concurrency number. So how you can do that? One guy said, you need another cluster. You need another cluster, but it's not, actually it's not scale. So what we can scale? So we store the data, generate the data, kubes, index, and all the low data, to the S3, to like Bravo Storage, to like Google Cloud, okay? And we provision our, we call it a computing node on top of that. So for example, in the night, you do not need, there are two parts, okay? The first is we call it a building. That means we will process the data, right? And when your data pick time is coming, we can extend the cluster very well, and using the most available resource to build that like kubes and index. And after that, the workload is done and we will residing back, right? This is the first part. The second part is for the serving, for the query side. So for example, when at the beginning, that you just only need like see one nodes to alive, it's a small nodes to alive, to serving like in nights. But for example, Black Friday will come in, right? How many people will go over there? So in that case, you actually can provision a very huge data nodes, but you do not need to copy any data. You do not need to provision different cluster with the data, right? So that is the same thing called, we call it a high concurrency, okay? Our test result is very good, okay? And yeah, this is what we call it electric scaring, okay? So it will handle the peak time, you know? When your data is coming, for example, most of the data is coming actually as a batch in nights, okay, in a night. So it can handle that, and our system can monitor the usage of that, and we have some rules to expand the cluster, yeah, but within all your quota, okay? So, and also we can use it like see the spot instance to reduce all your total cluster, okay? Yeah. And another one is very important for security. When you bring the data to there, you do not want to generate a one-data set for general one-data set, for spam one-data set, for like England, right? You want to just one-data set, but apply a very good like a SEL on top of that, and we can support this, and as a result, we can support it called SEL level SEL, so you can grind over there, okay? And also we support SDK, you can combine those things to your own application, okay? And yeah, Excel, always. It's very easy, you just need to download the ODBC driver and it can connect to the server over there. And by the way, like Power BI, from last year, Power BI embedded our connector from the release, so if you're using like Power BI desktop, you can find a KDGIS connector inside there. Yeah, it's very good experience if you're using that. Yeah, also like Tableau, also like Click, and also other like, you know, BI tools. And yeah, one thing, very interesting, we call it a WeChat application. I'm not sure how many people are using WeChat or see it here, right? So everybody wanted to enable their boss to access those data, to cooperate with the data from mobile. But if you need an iOS or like Android developer, it will not happen. It's too complicated. Now we actually can have those things either, you just need to build your reports and we have a, we call it a KDGIS inside of plus. It can help you to push, publish your data to your like WeChat. And you know, WeChat is something like, WhatsApp or like a Facebook message. You can very easily to collaborate between the teams, right? So this is a very interesting way. You do not need to find, there's no need for any like iOS development or you do not need to do any like coding. You just need to find someone analyst can do this way. Yeah, I want to show, last I want to show you a case. This is actually one of the largest Microsoft SQL server analytics services. And we migrated these things from on-premise to Azure. Okay, still to Azure. And the previous version, they put a very complicated architecture because the limitations they have, so they have to build hundreds of different cubes. They have a one main cube and like hundreds of sub-cubes and using a very complicated application to manage over those. And when they are fresh those data, you know, it cannot serve for the queries. So this is the biggest, you know, finance services in the United States. And we successfully replaced this. And so the last slides, I would like to see the takeaway. So we introduced the Semitic layer. We're using our AR augmented engine to simplify everything over there and is Spark native and the Cloud native. So yeah, and also we can help you to have the high concurrency, but very lower TCO. Okay, you can go to our posts out of there. We will have people over there and we have our partner, Kimi. They will introduce a telecom demo over there. And also we have people usually will be over there. We probably will have a workshop over there at two o'clock this afternoon. Okay, so yeah, thank you very much.