 Hello, welcome to our session today. In this session, we will introduce our lessons learned and insights of the collaboration of big data and AI machine learning technologies through an actual project to detect the kind of poison weed in Nordic 2018. We have utilized mainly two kinds of open technologies in this project. So one is the distributed processing platform technologies which are Apache Hadoop and Spark. Another is a deep learning framework TensorFlow. We would also introduce some design tips of a combination of two heterogeneous technologies in this session. It would be great if you get something that is better for you. My name is Naoto Umemori from NTT Data Japan. My background skills are mainly IT infrastructure, distributed computing based on open source technologies and the machine learning infrastructure and so on. The core speaker is Masaru Dovashi who also belongs to NTT Data Japan. And he has a bunch of experiences of design and implement commercial system by using open source software regarding Apache Hadoop, Spark, Kafka and so on. This talk has two sections. The first sections will be talked by myself. In the first section, I introduce our journey how we have utilized deep learning for a use case of image detection. I will introduce some project background, motivation, challenges and architecture and data pipeline we've implemented. So later section would be talked by Cospeak and Masaru. He will share to you our lessons learned from the consideration of architecture design. So later section will be a little bit high level topics. However, we'll be happy if it'll be something useful for you. Okay, let's move on the first section. This section is the talk regarding to our actual project experiences. Now, a time comes when I would explain an impact for proper noun, giant hogweed. I strongly believe most of Japanese and Asian have no idea of giant hogweed. Giant hogweed is a highly toxic plant originating in the western Caucasus of region of Eurasia. It had spread across central and western Europe and there are sightings of giant hogweed reported from North America too. The sub of giant hogweed is the phototoxic and causes the fight for thematitis in humans resulting in blasters and scars. Landowners in Europe are obliged to eradicate giant hogweed from their land due to its toxicity and invasive nature. However, it's a very cumbersome process for the landowners to find and remove giant hogweed from large land because the landowners are doing by manual process. So objective of our project is to realize a system to automate the detection of giant hogweed by using drone, machine learning, distributed computing and those kinds of visualization dashboard. In particular, the system has three phases. So first is to correct aerial images taken by drone. This process is instead of a cumbersome manual process which is the hogweed seeking by human. The second is to store and pre-process the taken images into data lake to train the model by Python script with TensorFlow and to infer the hogweed by Spark application which uses TensorFlow. The last is to utilize a result data. And especially visually plot the data called nature coordinate where the giant hogweed would be into the map. This is the overview of our project and to end to end data flow. We had to tackle to solve some challenges in the project. Among them, these were three symbolic challenges in applying deep learning to big data platform to process a bunch of images. The first challenge is the data volumes. Our customers in this project are local government in Denmark. Each local government manages wide area land and they must find the giant hogweed in the land. So round idea of theirs is around 3,317 square kilometers. And the agricultural rate of Denmark is about 62.01%. So the estimated land area is 2,000 square kilometers approximately. It means drone has to fly on 2,000 square kilometers with taking pictures. Then in our estimation, the total data volume is over 200 terabytes. On the other hand, I guess most of AI machine learning projects would be run on single node environment as like single laptop for quick data analytics while actually trial and error to pressure models by something too as like notebook feature. However, machine learning tool sets or rivalries may not be good for processing big data. How the data scientists would tackle such kind of challenges? This is the first challenge. The second challenge is to prepare supervised data. As usual, lack of supervised data for the training is a common story in machine learning world. It's usually only possible for experts to prepare supervised data. And this is a very time consuming process. So the challenge is how do we make this process efficient? By the way, in our case, I didn't find giant hogweed from an aerial image taken by drone requiring of the specialized knowledge by biologists. These are sample images which are giant hogweed and not. Could you tell which is the hogweed from the images of positive class? As we can't possibly find that. Unless you are an expert, you can't find the hogweed even among images classified in the positive class. And it would be impossible to find this man all the images. The last challenge is various specialty, meaning it's to integrate heterogeneous technologies and cooperate with different background persons. This is well-known data science band diagram. In case of our project, the domain experts were as the biologists to determine the hogweed from aerial images. The person of the mathematics was a modular developer who are familiar with something machine learning flame or excelsis tensor flow. And the persons of computer science were modular developer and data engineer who are especially familiar with distributed computing such as a page Hadoop and Spark in our case. So the challenge was how we should choose proper technologies and toolsets which are good for each. This challenge may sound simple in words but we found it very tough to work on this project. Because if we have different interests, we care about different things. Sometimes it's not easy to even have a well conversation. Honestly speaking, communication between the modular developers and the modular operators was not good in our case. These are three challenges we faced on. What would you think about our challenges? Would you identify with any of these challenges? If you have any other challenges that you can share with us, I'd be grateful for your comment in the QA text chat. Now considering these challenges, I'm going to show you the system architecture that we have designed and implemented in the next slide. This diagram shows an overview of our system architecture. These are two points in this architecture. First, the AI machine learning infrastructure integrates a mechanism for manual operation to prepare supervised data. The second is that also some enterprise products are included that the system is basically consisted of open source software. Now, let me explain how to look at diagram. These are three major flames, from the left upstream in other world data source, data processing, and downstream. Now, the part of the data processing in the middle of the architecture consists of four major elements. Each element roughly conforms to the lifecycle of machine learning. That's it. Preparation, training, inference, and analysis. Analyzes in this context means analyzing the inference results. The elemental technologies used in each element look like this. If we focus on the open source software part, we use TensorFlow as a library for training and influencing data. This project had gotten started in the year 2010, and the fact that TensorFlow was really exciting at that time, and that there are numerous examples of it in the world, encouraged us to use TensorFlow. Apache Hadoop HLFS and Apache Spark are responsible for the data source and the processing engine that manages large amounts of data. And for the inference part, we call TensorFlow from our Spark application to perform simple distributed inference processing. We have considered using distributed TensorFlow, however, finally had decided to use Apache Spark to get the scalability easily because we are more familiar with Spark less than distributed TensorFlow at the time. For the data analysis part, we are using OpenStreetMap, and I'll explain it later why we use it and how we use it. This is architecture we've designed and implemented. We've adopted Amazon S3 as data lake to accumulate massive data used Apache Hadoop and Spark to process the data. Regarding machine learning part, use TensorFlow for training and inference. And especially, we have also used Apache Spark framework for inference processing to get the benefit of scalability easily. This architecture has four data pipelines, which are SuperParis data preparation pipeline, training pipeline, inference pipeline, and data analysis pipeline. From the next slide, I'll show you the data for as detailed of each data pipeline one by one. The first pipeline is the data preparation. This pipeline is used to prepare SuperParis data labelled by biologists. As initial status, there are a bunch of raw images stored in Amazon S3. And the first step, an application loss of raw images from Amazon S3 splits raw images to two by two size and stores spread images back into the Amazon S3. Our second step, an application pushes the file parts of the spread images stored in Amazon S3 to SAP HANA's data market. Our third step, the biologists use its manual labelling application which is running on SAP HANA platform to prepare available data. As final step, the labelled data are stored into Amazon S3. This is the data for all regarding data preparation pipeline. The point of this data pipeline is, especially the dot-red line part. I show you the detail of the application on the next slide. This labelling application has a user-friendly UI that non-engineers can operate intuitively. When the biologists judge it that this is a giant hogweed, the biologists selects a cell on the image. For instance, this EO cell means the picture which is the giant hogweed in the hotel. The point is that this application does not require any IT specialized knowledge. Another point is to introduce specialized knowledge of bi biologists into the system easily. By providing such a data preparation mechanism in advance, we are able to prepare enough data for analysis efficiently. The wide experts may often prepare super parts data in outside of the system that it should be ideal for modern developers, modern operators and domain experts also to use the one integrated system. This UI tool is not just UI tool for data preparation. This tool is a role of gateway for domain expert to integrate their specialized knowledge to the platform. So next is about training pipeline. A training application reads that label data and splits images from Amazon S3 called TensorFlow library to do training and stores a training result which are modern and permitted into HDFS. Inference pipeline, we have integrated TensorFlow and Apache Spark to realize distributed inference processing easily. In our approach, the Spark executors natively call TensorFlow library to infer the wizard. This is the giant hogweed or not. Then the inference results stored into the SAP HANA datamart for data analytics use. Let me give you a little more information about the processing overview of the inference application. A diagram representation of overview of the inference processing method is shown like this one. The inference application calls the Spark driver and each Spark executor calls the TensorFlow library to perform the inference process. Any synchronization process doesn't happen and the task is distributed in a very simple manner. Distributed inference may seem a bit complicated but Spark absorbs the conversion parts of the distributed process and give us a way to implement application simplify. Next, data analysis pipeline. The inference results has the confidence and the geo-coordinate where the giant hogweed would be like this one. We put this information on the open straight map. The red icon almost certainly means that it's a giant hogweed. Specifically, it's an inference result with a confidence value of 90% or more. The orange icon probably means it's a giant hogweed. Specifically, it's an inference result with a confidence value between 80% or more or and less than 90%. And finally, the blue icon means that maybe it's a giant hogweed. The confidence score is less than 80%. So I've explained system architecture and the data pipelines in the first section of this talk. What do you think? You may able to see that the technologies used by each player is very different. Again, the biologist as a domain expert have leveraged the platform to prepare super bytes data with a web-based UI which doesn't require IT skills. The model developers had used TensorFlow to develop the model. The model operator had built a distributed processing infrastructure using Apache Hadoop in the Spark to get the system scalability and then had performed inference processing using the models prepared by the model developer by combining the inference result with the location information and visualizing them. We were able to achieve a use case for visualizing the habitat of exotic plant species in danger. This is an overview of the project. So what did we learn from this project? In the next section, we would like to share our findings with you. Hello everyone. I'm Masaru Dobasi of NTT Data. The later part, I will talk about lesson learned from consideration of architecture design in this project. So this is a very famous picture and a common view in machine learning systems. As you know, machine learning is just one piece of machine learning systems. So how to wrap the machine learning algorithms and how to establish them as a stable system is essential. So basically machine learning is not specific to big data but we often use massive data to obtain more appropriate models. Consider if machine learning system means big data to what would machine learning systems be changed. So this is consideration about what happened if we scale machine learning systems. The left side is the original words, which is font, which were font in the original paper. On the other hand, the right side is our consideration about what machine learning systems will be changed to. In most cases, we need to consider about usability and scalability of the fall machine learning system carefully when we scale out systems. So in this presentation, I will focus on machine learning application with the scalability. The first topic is workflow and the second topic is the system architecture. Okay, so first I'll talk about workflow. This is an architecture image of a machine learning application with scalability on our architecture on our systems. Training process tends to be an unhoc processes or interactive processes for analytics. It will prior usability better than system performance on the other hand, inference process often handles high workloads and maybe imposed a severe service level. So it needs stability, sustainability and scalability. And also there are two phases in machine learning life cycles. First is model development phase and the second is the model operating phase. So first I will talk about two aspects of machine learning life cycles. This is the abstract of two phases in machine learning life cycle. Model development versus model operating. The executors and stakeholders may change a model development phase and also operation phase. In some cases they may be different at the company level, for example, you may order to specify, specialize the machine learning vendors to develop machine learning models. The left side has a model development phase. In this phase, business KPI and defining business KPI is important. And we often execute trial and error or prototyping of applications. And the mathematics and the business logic is important. On the other hand, in operation phase, the stable development of models to systems is important. And also we often evaluate or monitor the inference results and KPI achievement. In this phase, business logic and computer science is important. There are tools with high affinity, there are tools for each phases. Tools with high affinity for scaled machine learning systems vary from phase to phase. This is just an example and the impression of tools depends on each individual or organization's cultures. So how we choose a tool set for each? Okay, this is a consideration about design patterns of software choices or three patterns of the workflow regarding combination of model development phase and the model operation phase. As you can see in this figure, we can define three patterns and I will briefly introduce each pattern. Okay, this is the pattern number one. In this pattern, we selected a tool set familiar to model developers for both model development and operation phase. The benefits of this model of this pattern is that model developers have a high degree of freedom in development of models. This enabled us quick and small starts to develop applications. And also interactivity accelerates application development and debug. On the other hand, this approaches weakness and that is often difficult to achieve quality to meet system infrastructure requirements. And also experimental calls tends to be remained in applications. It's hard to construct development teams who are well trained about enterprise architecting. This is an example of open source software suitable for this pattern, Metaflow by Netflix. Metaflow is a human friendly Python framework that helps scientists and engineers to build and manage real life data science projects. Metaflow provides a unified API to the infrastructure stack that is required to execute data science projects from time to production. As you can see, this library provides intuitively understandable API for Python users, including data scientists. Based on data scientist friendly framework, we can specify steps and environmental parameters using the decorations style programming paradigms. So if your team holds infra engineers who are specialized for developing Python based enterprise systems, you can level its metaphors for benefits. Then pattern, this is a pattern number two. In this pattern, we use developer friendly tools in development phase and also we used operational friendly tools in operational phase. This pattern's benefit is that easy to use technology for model developers and operators. And also we can achieve clear responsibility to each other. There is a chance to re-merge experimental course during refactoring of applications. And also battery of the scalability as included in the operational phase. But this pattern's weakness is the conversion cost. For example, difficulty of refactoring, limitation of exporting and importing of models, and so on. And also the degree of freedom is a little low. It's somewhat difficult to manage the models and pass the learning results to the inference pipeline. This is because the systems may be separated each other. And this is an example of use cases. We were implemented machine ML Ops on Spark. We were using the machine learning and deep learning technologies for various services. They use Spark as a platform to construct the pipeline. For example, pre-processing ETL, training, predicting, and so on. Why did they choose Spark? They said, Spark are sweet for ETL. And we consider that Spark will support traditional machine learning model algorithms. They also used Horvath to train machine learning models and such process can be changed with other processes. They used the PetrStorm, which converts data sets from Sparky to the format readable for the machine learning libraries. So this is the pattern number three. This pattern, in this pattern, we selected a two-set familiar to model operator for both model development and operational phase. This pattern's benefit is ease to operate systems and scale out. This is because battery of scalability should be included in the infrastructure. On the other hand, the weakness of this pattern is that model developers seem to be a little painful because API and framework may not be well known for them. And it's sometimes hard to use state-of-art proposed approaches. This may be because research projects tend to use a certain framework or library to prototype algorithms. This is an example of use cases. Twitter leverages color for feature engineering. Twitter has a feature store and they use Scalding, a cascading framework to consume the offline feature data. They used Scala-based libraries to abstract feature catalog and also features are stored in Apache Hadoop HDFS. They chose Scala as a basic language for the feature engineering. And we consider that Scala is used in several software and systems for distributed computing and well-suited to the platform engineers who has knowledge about GVM. This approach is powerful when your team holds machine learning engineers who are well-trained about GVM-based system infrastructures. This is an example of software, open-source software, BigDL by Intel. BigDL is a distributed deep learning library for Apache Spark. We used BigDL to train simple neural network model on our Spark-based cluster in another project. As you can see, BigDL provides simple API to define neural network model and it's easy to leverage Spark's feature, scalability, stability, and operational knowledge of the GVM-based architecture. This is a reprint of patterns of software choices. I talk about main three top, main three patterns of the application choices. Through this consideration, we have some kinds of lesson learned. It's difficult to create an ideal architecture that makes it easy for each stakeholders to move around. While keeping in mind the division of responsibility between model development and model operation. And second, since appropriate form varies from company to company and organization to organization, individual architectural studies are necessary for your team. Okay, then next, I'll talk about system architecture of this project. This is a reprint of architecture and data pipeline of this project. We defined four data pipelines, preparation, training, inference, and analysis. This data pipeline is not so bad, but this is just a version one. And we had some pain points in this architecture. This is why we redesigned architecture and data pipeline. And this is a version two. We tried to realize both of scalability and flexibility of model representation using Spark as a framework to deploy a kinds of models. Here are some key points. I will briefly introduce you about such key points. So first, we configured evaluation process for each ID stack. And second, we monitored the deterioration of the inference results. And managing and using property model is important. And also traceability of the model lifecycle is also important. And we often use the wide variety of two sets to deploy models. So that is why it will be cared. And finally, scalability should be cared in everywhere in the system. So in this talk, we focus on inference processes. So this is internal. This is part of the internals of the inference pipeline. It's not enough just to enter the data to be inferred and get the result. For example, the system needs to be designed to include a flow to evaluate the results. Okay, for example, in this model, we use two kinds of data. Images itself for the inference and also we used images for sanity checking. And we distributed inference models using Spark feature and did not depend on the model serving systems in this model. This is the inference processes. And after obtained outputs, we stored several kinds of results in the data like on HDFS and aggregated them and exported data to the visualization systems. And finally, data scientists can obtain information to detect deterioration on models. So I will talk about more detail about inference process. The abstraction of functions used in applications is very important. We called inferencing functions defined in the model driver, which is defined in the separated class of the Spark application in the trained model. The signature of the inferencing function is generalized to be independent of any machine learning methods or algorithms. This is an example of Spark application. In this case, this frame holds metadata as well as images to filter or to select the required data. And as you can see in this figure, we defined pre-processing, inferring and post-processing methods externally from the driver program of Spark. This is a brief summary of sequence of Spark application from Infra app or Spark application driver, Spark driver application. We called the infer method and infer image method, which is defined in model driver. The model driver is defined in external of Spark driver application. This abstract, this is abstraction to represent various inference systems. This makes some kinds of benefits for us. For example, you can use model serving systems instead of directly load models in your application. We can easily extend such kind of features. Then next, I will talk about internal of inference pipeline and output data. The detecting and storing deterioration of confidence is important for machine learning lifecycle. So we stored results with low confidence on HDFS separatory from other results. This made the data like centric architecture simple and allowed us to extract candidates for data to be investigated and retrained. As a result, we were able to see at a glance the results that we need to pay attention. Okay, I summarize for this presentation. The former part of this presentation, we talked about our journey to apply deep learning for giant hardware eradication. In the later part, I talked about consideration of architecture design to adapt deep learning technologies for big data infrastructures. For example, workload and system architecture. Thank you for listening our presentation.