 Welcome everyone. We're back at the first Flink Forward Conference in the U.S. It's the Flink User Conference sponsored by Data Artisans, the creators of Apache Flink. We're on the ground at the Kabuki Hotel and we've heard some very high-impact customer presentations this morning including Uber and Netflix. And we have the great honor to have Xiaowei Jia from Alibaba with us. He's Senior Director of Research and what's so special about having him as our guest is that they have the largest flink cluster in operation in the world that we know of and that the flink folks know of as well. So, welcome, Xiaowei. Thanks for having me. Okay, so we gather you have a 1500 node cluster running Flink. Let's sort of unpack that, how you got there. What were some of the use cases that drove you in the direction of Flink and complementary technologies to build this? Okay, yeah, I explain a few use cases. The first use case that prompted us to look into Flink is the classical search ETL case. We basically need to process all the data that's necessary for search service. So we look into Flink about two years ago. The next case we use is the AV testing framework, which is used to evaluate how your machine learning model works. So today we use it in a few other very interesting cases. We use it to do machine learning to adjust the ranking of search results, to personalize your search results at real time to deliver the best search results for our user. We also use it to do real time anti-flow detection for ads. So these are the typical use cases we are doing. Okay, this is very interesting because with the ads and the one before that was that, was it fraud? Ads is anti-fraud, before that is machine learning, real time machine learning. So for those, low latency is very important. Now, help unpack that, are you doing the training for these models like in a central location and then pushing the models out close to where they're going to be used for like the near real time decisions or is that all run in the same cluster? Yeah, so basically we are doing two things. We use Flink to do real time feature update, which changes the feature at real time in a few seconds. So for example, when a user buys a product, the inventory needs to be updated. Such features get reflected in the ranking of search results at real real time. We also use it to do real time training of the model itself. This becomes important in some special event. For example, on China single stay, which is the largest shopping holiday in China, it's like generates more revenue than Black Friday in United States already. On such day, because since it goes on sale for almost 50% off, the user's behavior changes a lot. So whatever model you changed before does not work reliably. So it's really nice to have a way to adjust your model at real time to deliver the best experience to our users. All these things are actually running in the same cluster. Okay, that's really interesting. So it's like you have a multi-tenant solution that sounds like it's rather resource intensive. When you're changing a feature or features in the models, how do you go through the process of evaluating them and finding out their efficacy before you put them into production? Yeah, so this is exactly the AB testing framework I just mentioned earlier. So we also use Flink to collect the metrics, the performance of these models at real time. Once this data are processed, we probably add it into our OLAP system so we can see the performance of the models at real time. Okay, very, very impressive. So now, so explain perhaps why Flink was appropriate for those use cases. Is it because you really needed super low latency or that you wanted less resource intensive sort of streaming engine to support these? What made it fit that right sweet spot? Yeah, so Search has lots of different products. They have lots of different data processing needs. So when we looked into all these needs, we quickly realized we actually need a computer engine that can do both batch processing and streaming processing. And in terms of streaming processing, we have a few needs. For example, we really need super low latency. So in some cases, for example, if a product is sold out and you still display it in your search results, we use a click and try to buy, they cannot buy it, it's a bad experience. So the sooner you can get the data process, the better. So near real time for you means how many milliseconds does the... It's usually like a second, one second, something like that. But that's one second end to end talking to inventory. That's right. How much time would the model itself have to... It's very short. In the single digit milliseconds? It's probably longer than that. There are some scenarios that require single digit milliseconds. That's a security scenario. That's something we are currently looking into. So when you do transactions in our site, we need to detect if it's a fraud transaction. We want to be able to block such transactions at real time. For that to happen, we really need a latency that's below 10 milliseconds. So when we're looking at the computer engines, this is also one of the requirements we were thinking about. So we really need a computer engine which is able to deliver sub-second latency if necessary. And at the same time can also do batch efficiently. So we are looking for a solution that can cover all the computation needs. So one way of looking at it is many vendors and customers talk about elasticity as in the size of the cluster. But you're talking about elasticity or scaling in terms of latency. Yes, latency and the way of doing computation. So you can view the security scenario as super restricted on the latency requirement. But view the batch as the most relaxed version of latency requirement. We want the full spectrum, the support of the full spectrum. It's possible that you can use different engines for each scenario. But which means you are required to maintain more code bases which can be a headache. And we believe it's possible to have a single solution that works for all these use cases. So okay, last question. Help us understand for mainstream customers who don't hire the top PhDs out of the Chinese universities but who have skilled data scientists but not an unending supply and aspire to build solutions like this. Tell us some of the trade-offs they should consider given that the skill set and the bench strength is very deep at Alibaba and that's perhaps not as widely disseminated or dispersed within a mainstream enterprise. How should they think about the trade-offs in terms of the building blocks for this type of system? Yeah, that's a very good question. So we actually thought about this. So initially what we did is we were using data set and data stream API which is relatively low level API. So to develop an application with this is reasonable but it still requires some skill. So we wanted to make it even simpler for example to make it possible for data scientists to do this. So in the last half a year we spent a lot of time working on table API and SQL support which basically tries to describe your computation logic or data processing logic using SQL. SQL is used widely so a lot of people have experienced it. So we are hoping with this approach it will greatly lower the threshold for people to use a think. At the same time SQL is also nice way to unify the streaming processing and the batch processing. With SQL you only need to write your process logic once you can run it in different modes. So okay this is interesting because some of the Flink folks say you know structured streaming which is the table construct with data frames in Spark is not a natural way to think about streaming. And yet both the Spark guys say hey that's you know what everyone's comfortable with we'll live with probabilistic answers instead of deterministic answers because we might have late arrivals in the data but it sounds like there's a feeling in the Flink community that you really do want to work with tables despite their shortcomings because so many people understand them. So ease of use is definitely one of the strengths of SQL and another strength of SQL is the description it's very descriptive the user doesn't need to say exactly how you do the computation but it just tells you what I want to get. This gives the framework a lot of freedom in optimization so user don't need to worry about hard details to optimize their code. It lets the system do its work. At the same time I think the deterministic things can be achieved in SQL it just means the framework needs to handle late events such kind of things correctly in its implementation of SQL. When using SQL you are not really sacrificing such deterministicism. Okay this is we'll have to save this for a follow-up conversation because there's more to unpack there. But Xiaowei Jack, thank you very much for joining us and I'm parting some of the wisdom from Alibaba. We are on the ground at Flink Forward the data artisans conference for the Flink community at the Kabuki Hotel in San Francisco and we'll be right back.