 Hello, I'm Yushi, I'm a technical lead of HAD at Mankind. Today, I will talk client-side ML design strategy on the C2C e-commerce application. This is today's agenda. First, I will talk about what is client-side ML? Where do client-side MLs use? What is the advantage of client-side ML? Then, I'll share a tool stack to develop application using client-side ML technologies. And I will discuss the key points for client-side ML production development. Finally, I'll talk about the next step of client-side ML. Client-side ML, it is often called HAI as well. Nowadays, machine learning is widely used in production, focusing around web and mobile applications. The ML features are almost used in server-side training and inference. In server-side ML, data such as photos, water, and beast is uploaded to servers from user devices. And ML models are applied to the data on servers. Whereas in client-side ML, you perform ML model inference on user devices and data isn't uploaded to servers. I will introduce several use cases on client-side ML. First, Google Assistant. In this application, by talking to your smartphone or smart speakers, you can request various things to the devices. For example, to find out how to get the airport, you can simply ask to the device, how do I get to the airport by public transport? The device shows the result. Live Text. It was introduced in iOS 15. It allows you to easily copy and paste text from photos such as taken with your iPhone without any additional operation. Background replacement on Google Meet. I guess many people are using this. This is a widely used client-side ML feature or robots world. You can replace or blur your background in an offline meeting. Google IME. This is a familiar feature for Japanese. It will suggest the appropriate words depending on the context you are typing. I will share the use case in Merakari. Merakari is a C2C e-commerce platform. Customers can list an item on Merakari, and the other customers can buy one. They released Merakari Lens Beta that uses client-side ML technology. In this feature, by just holding your smartphone on an item, you can find the category of it in Merakari. The similar items sold in Merakari and the price range of the similar items. You can see the market price very easily and check it one after another since you don't need to take a photo or input any information. The client-side ML can achieve new user experience. It can achieve more human-like interface. In Google Assistant and Smart Speakers, you can request something to the device by speech, the same as the human, and more interactive experience can be achieved. In the live text of iOS, you can get text directly from image. In Merakari Lens, you can get information by just holding a camera on the item. And it can help personalize the application. In Google IME, the suggestion will be personalized more and more by running your input without uploading the very text. The suggestion model is trained on your devices. Therefore, your privacy is protected. Recently, many tools, SDK, or libraries for client-side ML has been developed. When you create an application for Apple devices, you can use Core ML. In this framework, you can create an ML model on your Mac and convert it to the mobile-available format easily. And the model can be used on sweet code. When you create a cross-platform application, ML Kit is useful. In this document page, you can see many examples of client-side ML applications such as barcode scan, face detection, OCR, translation, and more. Many use cases are supported by this example. When you try more advanced features, you can use more low-level open-source libraries like TensorFlow Lite, TPM, or PyTorch Mobile. The development steps for any library are similar. You need to train your model on the server and compare it to a specific format for the runtime. This library is TensorFlow Lite, TPM, PyTorch Mobile, and provides both converter and runtime. In this way, you can use any models you like on the client-side. So now, you can create your own application using client-side ML easily for just demos. Is it also easy to provide applications for production? Unfortunately, there are still difficulties to introduce client-side ML into production. So, I will share the strategy for development of Merkai lands. When considering production development, there are four key points to check. Reason, size, time, validation. You should consider them in order from the top according to the phase of development. The first point is reason. First of all, you need to consider whether the client-side ML is truly required. As we discussed, the client-side ML can achieve new experiences, and it has potential to improve UX dramatically. However, developing and operating it requires additional engineering costs. For example, tuning model and library for valid devices is needed. The user and the user log is limited because the data is not uploaded and inference is performed on user devices. So, analyzing log is difficult. And the model should be converted to a special format and run on a wide span of environments. So, tests for ML models become difficult. If the ML model includes the application binary, the application update is required for the model update. To avoid it, you need special infrastructure for distributing ML models. So, if the server-side is enough for your use case, you should choose server-side. When should we use the client-side? Several situations cannot be achieved without client-side ML. These are three typical use cases that the client-side ML is inevitable. First, the application requires extremely quick and stable response time for ML. For server-side, the response time is unstable due to network connection, and it often exceeds one second. If the response time should be less than 50 milliseconds, the client-side is required. Second, when the data managed in the application is privacy sensitive, and it cannot be uploaded to the server, you have to choose client-side. The last case is the ML inference in an offline environment. Needless to say, the server-side ML can only be used in an online environment. Let's take our application, Melkare lens, as an example. In this application, there are six steps for displaying the information. Getting image from camera stream, perform object detection on the image, track the detected object, crop the object region, and search similar items on Melkare. Estimate the price from similar items, and display the result. We should perform all steps on the camera stream and display the result as smoothly as possible. At first, we thought, can all steps, except for getting image and display, run on the server-side? If it is possible, it will be the easiest way. On the client-side, just calling the API of the servers. However, it is difficult to locate the object detection part and the tracking part to the server-side. As object detection and tracking should apply to all frames of camera stream, these steps must be completed in near 30 milliseconds. And if all images in the camera stream are uploaded, there are privacy issues. So, these steps should be located on the client-side. For the steps of such similar items, it should run on server-side. This step is not needed to perform for every frame. It is enough to be called only once when a new object is detected. And as the variation of items which are sold in Melcali is unlimited, so the search index becomes huge to achieve higher search accuracy. The huge search index can be located on user devices. For the process mission step, it is only called after the search step and the computation cost is low. So, each side is okay for this step. We locate it on server-side. The second key point is size. We have to be concerned about how much size will be increased by introducing the client-side ML to introduce client-side ML. ML models and related libraries have to be stored in user devices. Therefore, the application size will be increased. It is known that increasing application size decreases the download of the application. So, you have to check if the size increase is acceptable. I listed the typical mitigation measures for this problem. You should choose lightweight models such as MobileNet. And let's try further tuning model architecture for size reduction. And model quantization is an effective measure. It probably reduced the size of the model significantly. Finally, it is better to try tuning library size as well. Maybe there is room for tuning build options. In our case, we compare several object detection models designed for mobile use, such as SSD MobileNet, V2 SSD Lite MobileNet V2 and SSD Lite MobileNet V3 small. In general, there is a trade-off between parameter size and accuracy. For example, MobileNet V3 small is extremely small, but the accuracy is low compared to MobileNet V2 series. We found the accuracy of the MobileNet V3 is not enough for our use case. So, we selected SSD Lite MobileNet V2. The default SSD Lite MobileNet V2 model influences the location of the object in the image and its class, such as dog, cat, or bird. And the classification needs a fully connected layer. It is well known that the number of parameters in a fully connected layer increases significantly with the number of classes. In our use case, we don't need the classification. So, we removed the fully connected layer for the classification. Let's check the effect of removing the fully connected layer. If the model has a fully connected layer for 600 classification, the model size will be 49 megabytes. But if we remove the fully connected layer, the size will be 12 megabytes. The parameters of the model are represented by Float32. The converter for TensorFlow Lite can transform it to Float16 or Uint8. It is called quantization. By quantization, the size will be half Float16 and quarter for Uint8. However, the accuracy of the model often degrades due to the parameters being truncated to Float16 or Uint8. In Mellify Length, we selected Float16 because Uint8 has not enough accuracy for our use case. Now, summarizing what we did in the size path, the size of the default model we selected was 70 megabytes. We reduced the size to 12 megabytes by removing the fully connected layer. Then, by quantizing to Float16, the final size is 6 megabytes. Finally, it is acceptable for the production release. Next, the next key point is time. In this section, I will discuss how to make the application smooth. In video processing, images are input to the system 30 times in one second. So, if the ML model is applied to all frames, the inference time should be less than 33 millisecond. If the model inference can't be completed less than 33 millisecond, it can't receive the next inputs and some frames will be skipped. That is the cause of lagging. In the previous section, I said that we removed the fully connected layer for size reduction. It has a big positive effect for inference time as well. If a fully connected layer for 600 classes exists, the inference time is about 60 milliseconds. By removing it, the time is reduced to 32 milliseconds. It looks to be able to complete inference in StoryMix. However, it's not good enough. There are many free and post-processes for ML inferences. So, in live implementation, all processes should be completed in 33 milliseconds, including pre-process, inference, and post-process. To resolve this problem, we introduced MediaFib. MediaFib is an open-source project developed by Google. It's a cross-platform and customizable framework for StoryMix media. They provide the following samples, space detection, hand tracking, object detection, tracking, and hair segmentation. So, we introduced this object tracking example to our application. MediaFib are used to represent the steps of inference as a pipeline. For example, getImage, pre-process, and object detection, post-process, and display result. These steps can be represented as one pipeline. The nodes in the pipeline can be processed in parallel based on their dependencies. The inference node depends on the pre-process node, and the post-process node depends on the inference node. So, while the inference node is performed for n plus 1 frame, the pre-process can be performed for n plus 2 frame. In this way, all steps, including pre-process and post-process, is not needed to complete in 33 milliseconds. MediaFib enables you to represent complex pipelines, which can be further optimized. In medical terms, the object detection and object tracking are performed in client-side. The detection receives image as an input, and the tracking receives image and detected object. Therefore, the pipeline can be defined like this. The detection depends on only the image node, and the tracking depends on the detection and image node. The object detection is the most compute-intensive and time-consuming node as the deep learning model. The object tracking we use with traditional computer vision technology with deep learning is not so much compute-intensive node as object detection. To reduce computation costs, we can optimize the graph like this. The tracking node can take its own output as input, and the detection node is called only several times in a second. In this pipeline, once an object is detected in the detection node, it continues to be tracked in the tracking node until it is updated in the detection node. With this optimization, the whole computation cost is drastically reduced while maintaining the smoothness of the UX. The last key point is validation. This is the most distinctive part of client-side ML. For the client-side ML, you may have already noticed the environment of ML inference is not fixed. It's a user device, and the inference time depends on the device's spec. For the client-side ML, you may have already noticed the environment of ML inference is not fixed. It's user devices, and the inference time depends on the device's spec. If the device is high-end, the inference time is short and birth-bad. Therefore, benchmarking and validating on real devices are needed. To resolve this problem, we have been developing a UX validation platform called Jetfire. On this platform, we can validate model inference time and storing inference performance and search accuracy. I'll show you a demo for it. Like this, you can check the model inference time on real devices. This is an example of Android devices such as Pixel 3 and Galaxy 9s. The smartphones have several computational units such as CPU and GPU. You can check the performance of the computational unit as well. For example, this is the result of CPU and GPU for each example Android devices. According to this result, the GPU is significantly high performance compared to CPU. So focus on GPU performance. The Galaxy 9s is much faster than Pixel 3, but the inference time of Pixel 3 is 2 9 milliseconds. So it's less than 30 milliseconds. The inference time looks acceptable. And then you can check the streaming inference performance on real devices. This is the tracking result of Google Pixel 3. And this is the tracking result of Google Pixel 5. On this platform, so you can validate the tracking work properly on every devices. And this is the validation for the search accuracy. This is the example items sold on Melcari. And this is the estimated prices and estimated categories. So this is the category estimation accuracy for each categories. And this is the score or price for each categories. By this platform, you can validate the search accuracy. So this platform enables you to validate that your application can provide the expected UX to a wider span of user devices. I have been talking about client side email for native mobile applications so far. From here, I will talk about the project that is currently under development. The next project is for the web. Web assembly called WASN is a cross-browser binary format that brings near native code execution speed for the web. Recently, with many major browsers supporting WASN, client side email on mobile browser is getting a lot of attention. The main merit of web is that customers can use the feature without application in installation. So we can reach customers who haven't created an account or installed the application yet. Both TensorFlow Lite and Media Vibe can be compiled to WASN binary. So we believe that we could migrate from native application to the web easily. We thought the application would be running browser just building TensorFlow Lite and Media Vibe into WASN binary. Unfortunately, it was wrong. There are two big issues. First, WASN thread is not supported on mobile browsers. In medical lens for native application, the process pipeline uses large thread. So our pipeline can run on mobile browsers. Second, web assembly SIMD is not supported on iOS browsers. SIMD is a type of parallel processing and it improves the performance of machine learning influence. The thread performance degraded drastically with SIMD. So we needed to modify the architecture of the application. This is the architecture for the native application. All processes are in the Media Vibe and TensorFlow Lite is used internally. This is the new architecture for the web. We introduced TensorFlow.js and redesigned the pipeline for the web. TensorFlow.js is an open source library. It enables high performance machine learning influence on web browsers. It supports WebGL and WASN pipeline. So it can achieve high performance influence with WebGL instead of WASN. The deep learning model influence in object detection is performed by TensorFlow.js. The object tracking is performed in Media Vibe and the algorithm is implemented. The tracking pipeline and its algorithm is able to run on one thread. Finally, we could migrate the application to the web. We replaced TensorFlow Lite with TensorFlow.js and use a part of the pipeline on Media Vibe. And the tracking algorithms had to be re-implemented. Now we achieved native-like performance on web browsers. This is the conclusion of my presentation. Recently, many tools for client-side ML are available. So you can create demos easily. But when you think about introducing client-side ML to production, there are four key points. Reason, size, time, validation. For web, there are still many limitations compared with native applications. However, the situation has greatly improved recently. So we are looking forward to the future development about WASN related technologies. Thank you for listening.