 Live from San Francisco, it's theCUBE, covering Flink Forward, brought to you by Data Artisans. This is George Gilbert, we are at Flink Forward, the user conference for the Apache Flink community sponsored by Data Artisans, the company behind Flink. And we are here with Shuyi Chen from Uber. And Shuyi works on a very important project which is the CalSight query optimizer, SQL query optimizer that's used in Apache Flink as well as several other projects. Why don't we start with Shuyi, tell us where CalSight's used and its role. CalSight's basically used in the Flink table and SQL API as the SQL parser and query optimizer and planner for Flink. Okay, so now let's go to Uber and talk about the pipeline or pipelines you guys have been building and then how you've been using Flink and CalSight to enable the SQL API and the table API. Sort of what workloads are you putting on that platform or on that pipeline? Yeah, so basically I'm the technical lead of the streaming platform, processing platform in Uber and so we use Apache Flink as the stream processing engine for Uber and basically we build two different platform. One is called Adina X which use Flink SQL. So basically enable user to use SQL to compose their stream processing logic and we have a UI and with one click they can just deploy the stream processing job in production. When you say UI, did you build a custom UI to take essentially a business intelligence tool so you have a visual way of constructing your queries? Is that what you're describing? Yeah, so it's similar to how you compose your write a SQL query to query database. We have a UI for you to write a SQL query with all the syntax highlight and all the hint to write a SQL query so that even the data scientist and also non-engineers in general can actually use that UI to compose stream processing lock jobs. Okay, give us an example of some applications because this sounds like it's a high level API so it makes it more accessible to a wider audience. So what are some of the things they build? So for example, in our each team, Uber each team, they use the SQL API as the stream processing tool to build a restaurant manager dashboard. Restaurant manager dashboard. Okay. So basically the data, the log live in Kafka get real-time streaming to the frame job which it's composed using the SQL API and then that got stored in an old lab database, P-notes. And then when the restaurant owners open the restaurant manager, they will see the dashboard of their real-time earnings and everything. Yeah. And with the SQL API, they no longer need to write the frame job. They don't need to use Java or Scala code or do any testing or debugging. It's all SQL. So they, yeah. And then sort of what's the SQL coverage? You know, the SQL semantics that are implemented in the current CalCyte engine? So it's about basic transformation, projection and window, like hopping and tumbling window and also join and goodbye and having and also not to mention about the event time and processing time support. And you can shuffle from anywhere. You don't have to have the join, sort of you don't have to have two partitions with the same, you know, join key on one node. You can have arbitrary, the data placement can be arbitrary for the partitions. Well, the SQL is declarative, right? And so once the user composes the logic, the underlying, the planner will actually take care of how to key by and how to group by everything. Okay, because the reason I ask is many of the early Hadoop-based MPP SQL engines had the limitation where you had to co-locate the partitions that you were going to join. That's the same thing for Flink. Oh. Yeah, but it's just the SQL part is to take care of that. Okay. So you do describe what you do, but underline it gets translated into a Flink program that actually will do all the co-location. Oh, it redies it for you. Okay, okay. So now they don't even need to learn Flink. They just need to learn the SQL. Okay. Yeah. Now you said there was a second platform that Uber is building on top of Flink. The second platform is the, we call it the Flink as a service platform. So the motivation is we found that SQL actually cannot satisfy all the advanced need in Uber to build stream processing. Due to the recent life, for example, they will let you call our RPC services within their stream processing application or even training the RPC call. So which is hard to express in SQL. And also when they are having a complicated dag, like a workflow, it's very difficult to debug individual stages. So they want the control to actually to use the native Flink data stream API, data set API to build a stream on batch job. Is the data set API the lowest level one? No, it's on the same level with the data stream. So it's one for streaming, one for batch. Okay. Oh, data stream and then the other was table. Data set. Data set, data stream, data set. And there's one lower than that, right? Or are these? The lower, the lower that there's a one low API but usually normal, most of the people don't use that. Oh, so that's systems programmers. Yeah. Okay. So. So the tell me, so who is using, like what type of programmer uses the data stream or data set API and what do they build at Uber? So for example, in one of the talk later, there's a marketplace team, marketplace dynamic team. It's actually using the platform to do online model update, machine learning model update using Flink. And so basically they need to take in the model that is trained offline and do a few like goodbye, like time and location and then apply the model and then increment the updated model. And so are they taking like a window or a window of updates and then updating the model and then somehow promoting it as the candidate? Yeah, something similar. Okay, that's interesting. And what type of, so are these the data scientists who are using this API? Well, data scientists are not really, it's not designed for data scientists. Oh, so they're just doing the models off, they're preparing the models offline and then they're being updated in line on the stream, the stream processing platform. Yes. And so it's maybe data engineers, data engineers who are essentially updating the features that get fed in and are continually training or updating the model. Yeah, basically it's an online model update. So as Kafka event comes in, continue to up, we find the model. Okay, and so as Uber looks out a couple of years, what sorts of things do you see adding to either of these pipelines and do you see a shift away from batch and request response type workloads towards more continuous processing? Yeah, yes, actually, yeah, we do see that trend, actually. Before becoming the tag lead of stream processing platform, team in Uber, I was in marketplace as well. And at that point, we always see there's a shift, like people would love to use stream processing technology. They'll actually, we play some of the back-end, normal back-end service applications. It's so long as, yeah. Tell me some examples. Yeah, for example, so in our dispatch platform, we have the need to actually show the workload by, for example, riders to different hosts to process. For example, compute like, say ETA or compute some of their time you know average, right? And this is before done in back-end services and like, say use our internal distributed system things to actually do the sharding. Yeah. But actually with Flink, this can be just done very easily, right? And so actually there's a shift that I mean those people will also want to adopt stream processing technology. And so long as this is not a request response style application. So the key thing, just to make sure I understand, it's that Flink can take care of the distributed joins, whereas when it was a sort of a database based workload, DBA had to set up the sharding and now it's sort of more transparent, like it's more automated. I think it's more of the support, like right, so if before people riding back-end services, they have to write everything. The state management, the sharding and everything. Oh, it's not even database based. Yeah, it's not database, yeah, it's real-time. So they have to do the physical data management and Flink takes care of that now. Yeah, yeah. Oh, got it, got it. Yeah, that's not, some of the application is real-time, so we don't really need to store the data all the time in the database. So it's usually keep in memory and somehow gets snapshot, but we have for long more back-end service ride that they have to do everything. But with Flink, it has already built-in support for state management and all the sharding, partitioning, and the time window, aggregation primitive, and it's all built-in, and you don't need to worry about we implement the logic and we architect the system again. So it's a new platform for real-time. It gives you a whole lot of services, higher abstraction for real-time applications. Yeah, yeah. Okay, all right, with that, sure you, we're going to have to call it a day. This was Shu Chen from Uber talking about how they're building more and more of their real-time platforms on Apache Flink and using a whole bunch of services to complement it. We are at Flink Forward, the user conference of data artisans for the Apache Flink community. We're in San Francisco. This is the second Flink Forward conference, and we'll be back in a couple minutes. Thanks.