 Hi, thanks for coming to our session. Today, we are going to talk about acceleration techniques of image preprocessing and the effect for machine learning system. Before beginning our talk, please let us introduce ourselves. I'm Kyosuke Hashimoto and I'm research engineer at Hitachi, and my current interest is developing AI solutions with open source software such as Q-Flow and public cloud such as AWS and Microsoft Azure. Hi, I'm Masahiro Ito. I work at Hitachi as a software engineer. I'm responsible for leveraging open source software related to big data and AI. I'm developing big data and AI solutions for customers who are going to build enterprise systems. I've been developing data lakes and data pipelines using Apache Hadoop, Spark, HBS, Kafka and more. Now, I'm focusing on developing MLOps solutions using QBflow. Okay, these are the outline of our talk. Our presentation is in five parts. First, we will introduce our business while we focus on image preprocessing with deep learning. Next, we would like to talk about the evaluation scenario of image preprocessing. In part three and part four, we will introduce our evaluation results. Finally, we will give some discussion about the evaluation results and give a brief summary of what we have covered. Okay, let's move on to the introduction. To begin with, I'm going to talk about our motivations to focus on deep learning. Deep learning technology is vitally used for processing various data, such as image, text and audio. These data processing technologies are applied in various industries, such as healthcare, manufacturing, security and automotive. Our company Hitachi has a lot of products and services in these industries. Image processing is especially used in such industries. For example, image processing is applied to medical image analysis, action recognition, biometrics and self-driving cars. Let's talk about production machine learning systems with deep learning. Production machine learning system contains various components, such as data preprocessing, model training, surfing and model monitoring. So, model training with deep learning is just a single part. In these parts, data preprocessing is an important part of the machine learning system. Data preprocessing is required in both model training and surfing phase. This figure shows the data processing flow during training and surfing phases. In the training phase, data scientists create a machine learning model. In this phase, data scientists extract features from raw data. Data scientists develop this data preprocessing logic by feature engineering. The extracted features are used to train models. By speed-up data preprocessing, data scientists can perform more experiments to improve their models. In the driving phase, machine learning system provides predictions using a trained model in a production environment. In this phase, the model requires the same format of input data during training. So, we need to deploy the same data preprocessing logic as in the training phase. Therefore, speed-up data preprocessing takes both experimental and advanced time. That's why we focus on speed-up data preprocessing. I'd like to talk about open source deep learning frameworks and data preprocessing libraries. There is a lot of open source software in this category. The lower left graph shows the popular frameworks and libraries. As you can see, TensorFlow and PyTorch have become very popular deep learning frameworks. PyTorch has image preprocessing library named TorchVision. OpenCV is a popular computer vision and machine learning library. So, OpenCV is often used for image preprocessing. In this presentation, we focus on TorchVision and OpenCV. I'll explain the evaluation outline for TorchVision and OpenCV for image preprocessing. In order to speed up image preprocessing, we applied acceleration techniques for these libraries and evaluated their effects. We compared performance between TorchVision and OpenCV using acceleration techniques in serving and training phase. In the serving phase, we evaluated the performance of image preprocessing only. This evaluation does not include infants by model. Firstly, we checked the baseline performance. Then, we evaluated the performance of image preprocessing when applying acceleration techniques. In the training phase, we evaluated the combined performance of data preprocessing and model training. Next, we are going to explain our evaluation scenario. In order to evaluate the preprocessing performance in general, we needed to find a good sample scenario. We thought that evaluation scenario for deep learning application should cover general image preprocessing techniques as we have shown on the left table. In the left table, we have shown the classification and examples of image preprocessing techniques. We can classify the preprocessing techniques in three class. One of them is emphasizing valuable features, and second one is removing noise from value-less features. And third one is augmenting features. And we have chosen MLPerf for evaluating deep learning system performance. MLPerf is a benchmarking project for machine learning system that is presented in 2018. And both cloud enterprises such as Google and Microsoft, and also hardware industries such as Intel and Nvidia joined the project. We selected object detection provided by MLPerf as an evaluation scenario. Please look at the left table. MLPerf covers the general deep learning techniques as we have shown in the table. And why we selected object detection is that it covers general preprocessing techniques and also it is widely used in various areas such as autonomous driving and cancer detection. Please look at the right figure. It indicates the image preprocessing needed for object detection. As you can see in the figure it covers all classifications of general image preprocessing such as emphasizing augmenting images and deleting some unused features as we have shown in the previous slide. Now we are going to explain how we evaluated the effect of accelerating preprocessing in the learning phase. We assumed that we needed to process as much images as possible in production. We used these hardware and software for the evaluation. Please look at the left top table. This is the evaluation environment we have used. We used the machine which has multi-CPU cores. Next, please look at the left bottom table. This is the parameter set. We changed the number of processes from 1 to 20 to observe the effect of parallelism in production. Also, we iterated the preprocessing by 1,000 times this time. Finally, please look at the right table. This is the software specification. We selected open source software and versions as we have shown in this table. This figure indicates the total processes we evaluated. We compared the performance between TorchVision and OpenCV using acceleration techniques. Our evaluation scope included reading files, preprocessing files, and writing image files. To begin with, we applied parallelism as an acceleration technique. We compared the preprocessing time and average CPU usage during preprocessing with TorchVision and OpenCV. We could increase the number of processes by inputting arguments on ML perf codes. Please look at the left figure. It indicates the preprocessing time. X axis represents the number of processes. As you can see in the figure, preprocessing time was reduced by increasing the number of processes. Please look at the next right figure. It indicates the average CPU usage. X axis represents the number of processes. As you can see in this figure, CPU usage of OpenCV was much higher than TorchVision and also it increased almost 100% when we increased the number of processes by 4. Next, we compared the total time of preprocessing and file output with TorchVision and OpenCV. Please look at the left figure. It indicates the total time of preprocessing and output files. As you can see in the figure, total time stayed steady even if we increased the number of processes. Also, we compared the average CPU usage during preprocessing and file output. Please look at the right figure. It indicates the average CPU usage. X axis represents the number of processes. As you can see in the picture, CPU usage stayed approximately 80% when it comes to OpenCV. That means that we cannot make the most of CPU usage after we added outputting files. Now, let us discuss why parallelism did not improve overall performance after we added file output. As you see in this figure, after finishing preprocessing in every single thread, preprocessed images are stored in result queue and waiting for the next file output procedure. In this architecture, however, file output is high loaded procedure much time than preprocessing. Then, the result queue is full and degrades the throughput. Thus, we decided to increase the number of result queues and observe the performance. We increased the number of result queues, which is equal to the number of processes, made output file procedure in parallel and compared the total time of preprocessing and file output by both version and openCV case. Please look at the left figure. It indicates the total time of preprocessing and output files by increasing the number of queues. X axis represents the number of result queues, which is equal to the number of process. As you can see in this figure, total time was reduced in both total division and openCV case by increasing the number of queues to output files. Please look at the right figure. It indicates the comparison of total time when we changed the number of result queues from 1 to 10 and also preprocessing only. Please note that the number of process is 10 this time. As you can see in this figure, by increasing the number of result queues, the total time became close to the time we did preprocessing only. Next, we compare the CPU usage after increasing the number of result queues respectively. Please look at the left figure. It indicates the average CPU usage of preprocessing and output files by increasing the number of queues. X axis represents the number of result queues, which is equal to the number of process. As you can see in this figure, in tortivision case, CPU usage increased by increasing the number of queues for file output. On the other hand, in OpenCV case, CPU usage stayed approximately 100% by increasing the number of queues for file output. These trends appeared when we compared the average of CPU usage after increasing the number of result queues and preprocessing only. In both case, CPU usage became close by increasing the number of queues for file output. But in OpenCV case, when the number of process is 10, CPU usage reached approximately 100%, which implies that even if we increased the number of result queues and the number of process, we could make the most of CPU resources and total time might not improve much. Thus, we can say that we should apply tortivision in preprocessing when we can increase as much processes. On the other hand, when the number of process is limited, we should apply OpenCV. Next, we are going to evaluate the effect of accelerating processing in training phase. In training phase, we assumed that the data scientists had high performance computer which had multi-core CPUs and GPUs for preprocessing and training. We used these hardware and software for the evaluation. We adopted Ubuntu this time, since it's better for utilizing latest OpenSoft software. And also, please note that we changed the number of processes from 1 to 12, which is the same number of CPU cores that we can expect the best processing time. The software specifications are shown in the right table. This figure shows the evaluation scope in training phase. We compared the performance in a combination of data preprocessing and model planning. In data preprocessing library, we used tortivision and OpenCV in CPU mode and model training framework, we adopted PyTorch in GPU mode. And our evaluation scope includes reading files, preprocessing images and also training models. We compared the total time and resource usage of preprocessing on CPU in parallel and training on GPU with tortivision and OpenCV. Please look at the left figure. It indicates the total time of preprocessing on CPU and training on GPU. Please note that the number of GPU is 1. As you can see in this figure, total time stays steady, even if we increase the number of process from 1 to 12. We can explain why the trend happens when we observe the resource usage in the right figure. This right figure indicates the resource usage of preprocessing on CPU and training on GPU. As you can see in this figure, GPU usage for training was much higher than the CPU usage for preprocessing. Thus, we can say that the effect of acceleration of preprocess was weak since training took much time than preprocessing. In order to improve the bold performance in this case, we increased the number of GPU for training and observed the total time of preprocessing and training with auto-rotivision and OpenCV. Please look at the figure. It indicates the total time of preprocessing on CPU and training on GPU in both OpenCV and Rotivision case. Please note that the number of process is 1. As you can see in this figure, total time reduced after we increased the number of GPU for training. From this figure, we can say that for this use case, we should increase the number of GPU for training rather than increasing the number of GPU for preprocessing. Also, we can say that we should use GPU for training only, since training consumes lots of GPU resource and it cannot afford preprocessing anymore. Now, let us discuss how we accelerate the preprocessing performance in both serving and training phase. We discuss how we design the preprocessing system in both serving and training phase. In serving phase, parallelism and queuing enabled us to reduce the wall inference in last time. On the other hand, in training phase, we can say that we need to discuss whether we need to accelerate image preprocessing or not according to the purpose of the system. Table below shows the direction for designing preprocessing system in training phase. We think there are two types of design patterns in preprocessing system. One of them is preprocessing and training are executed continuously, such as this case. In this case, the advantage of the design is that we can accelerate the total time. Whether we need to tune preprocessing depends on the situation. We need to investigate whether preprocessing and training decreased the wall performance. The appropriate purpose of this design is training models every time in different data, such as incremental learning. The other of them is that preprocessing and training are executed in a different system. In this case, the advantage of the design is that we can save the resource cost. We should tune preprocessing performance this time since we can accelerate the wall performance by accelerating both preprocessing and training. The appropriate purpose of the design is the training models in the same data, such as parameter tuning. In conclusion, we evaluated the acceleration techniques of image preprocessing and its effect in deep learning application. From this research, we got new insights of how we utilized multiple open-source software to get better performance in deep learning application. From serving point of view, techniques such as parallelism and tuning could accelerate image preprocessing and it could accelerate the wall inference performance. On the other hand, from training point of view, we need to judge whether we apply acceleration techniques according to the purpose of the system. This is the end of our presentation. Thanks for joining us. If you have any questions and comments, please leave your feedback via text chat. Again, thank you very much.