 everyone. My name is Masahiro Tanaka from University of Tsukuba. I talk about PW Lake, I call it Lake for short, which is I am developing as a distributed workflow engine for eScience. First, let me introduce myself. My name is Masahiro Tanaka. I am the author of NRA. How many people in this room know about NRA? Thank you. Unfortunately, I am not going to about NRA today. I originally majored in astronomy, but I am not now working on computer science as a research fellow at Center for Computational Sciences, University of Tsukuba since 2009. Let me introduce Center for Computational Sciences. It research fields very wide. Computer science field includes high performance computing and computational informatics. And also computational science field includes particle physics, astrophysics, material science, life science, biology, environmental science. And also the center operates three supercomputers. By the way, do you know SC10, which is a conference on supercomputer? SC10 is held next week in here, New Orleans. It is so large that more than 10,000 people participate. And the venue is very close from here. Our Institute Center for Computational Sciences, University of Tsukuba exhibits in SC10. Then today's topic, break, distributed workflow engine for a science. I start with introduction. Big science tends to be conducted under international cooperation. Typical examples are LHC, Large Hadron Collider Atosome, which is a particle physics experiment. And the other one is ALMA, Atacama Large Millimeter array in Chile, which is a facility of radio astronomical telescope. Research in international collaboration requires geographically distributed computer resources. This slide shows an example of particle physics. Great computer systems are developed for here, the data produced from lattice QCD simulation. The point here is computer and network infrastructure is inevitable for scientists to access large-scale scientific data. E-Science is a term for large-scale science which requires distributed computers. From Wikipedia, E-Science is computational intensive science that is carried out in highly distributed network environments or science that uses immense data sets that require great computing. The term was created by John Taylor, Director General of the United Kingdom's Office of Science and Technology in 1999. In order to achieve science distributed computing is a key issue. Distributed computing is important also due to the end of New World's law that is the performance of a single core does not more increase. This topic sees as a famous as Jim Warwick talked in RubyConf 2008 and Matt talked in his keynote in RubyConf last year. In order to make use of multiple multi-core, you need to parallelize your program. For parallel programming scalability is an key issue. This is well known Amdahl's law which must be considered for parallel computing. If the proportion of the program that can be made parallel then y minus 1 minus p is the proportion that cannot be parallelized. If any processors are used the speedup is expressed as this formula and this figure shows there's a maximum speedup when p is less than 1. So in parallel programming it is important to maximize parallel portion of program. There are several parallel programming models. Recent popular model is MapReduce. In scientific computing MPI and OpenMP is commonly used. In some case, general thread programming is used and new parallel programming languages are so emerging. These parallel programming models are not easy but in contrast processes are easily parallelized. It does not require parallel programming. Assume that you have a set of input files. In this case independent processes can be parallelized without parallel programming. However, if the task dependency becomes complex it means workflow system. So let's move on to the next topic scientific workflow. Scientific workflow is a description of scientific data processing and analysis. It is similar to a task built to a program. However, scientific workflow tends to more complicated. A graph is often represented as a graph. A graph consists of nodes and edges. Tasks and files are represented by node. If task reads or writes a file the task and file are connected each other by directed edge which is indicated as an arrow. In this presentation task node is indicated by this node and file node is indicated by rectangle node. Workflow graph is known to be a directed acyclic graph, DAG. DAG is such a graph that each edge has a direction without acyclic loop. Let me show an example of scientific workflow in astronomy. This is montage, a tool for producing a custom mosaic image from multiple shots of images. Montage packages collection of programs. This is a schematic diagram of montage workflow. The workflow is divided into three parts. The first part is a projection of an image into celestial coordinate. The second part is the collection of background brightness and the final part is combining all the result images. Each part consists of several steps of tasks. Basically each process deals with one image so processes can be executed in parallel. This is a graph of montage workflow. The direction of the graph is right wall. The leftmost nodes are input files and the rightmost node is final input image. You can see a graph of scientific workflow is quite complex. In order to execute such a complex workflow properly, a workflow system is required. Workflow system needs to invoke tasks based on their dependencies and it must assign a process to one of available computers. If the tasks are independent, the system executes them in parallel. So far scientific workflow systems are developed mainly for computer grade systems. The problem is that the computer grade system is hard to construct for scientists so I think grade system is not the daily infrastructure for scientists. If you want to execute a workflow using a workflow system, you need to define the workflow in the language that the system recognizes. Many scientific workflow systems employ XML to define workflow graph. The problem here is human cannot write such complex XML by hand so it requires a program to generate XML. As a workflow definition language, make file is a good solution since it is a domain specific language to define task dependencies. Make has an extension-based rule to define multiple tasks at once which is useful to avoid redundancy. Another useful feature of make is to skip finished tasks according to timestamp of created file. This feature enables the result of an interrupted workflow. There is an excellent make-based workflow system. It is GXP, a tool to manipulate cluster system efficiently. It is written in Python. As one of the GXP features, it includes GNU make-based workflow system. It enables parallel and distributed execution of workflow. At first, I used GXP make to execute workflows. However, I found that make file is not enough to write a scientific workflow. This is because make is a built tool. Usually, make file is written for a certain set of source files. If source files are different, make file will be a bit different. The same file is executed repeatedly. These features come from the fact that make file is a built tool. But the scientific workflow has more different aspects. The first aspect is that the same workflow is reused for different sets of input files. The rule feature solved it to a certain extent, but it is not enough for it because scientific tasks are not always determined from file name. The second aspect is that the task dependency is determined from not only file name, but also parameters. A typical example is science data associated with geometry. Assume that I have a set of files and each file is characterized by two-dimensional region. I have to process overlapping area between two files. In this case, the workflow graph the right figure. However, such dependency cannot be solved from file name. So this case requires programming. The third one is that case that entire workflow is unknown at the beginning of workflow. It will happen if the rest of output files will be determined after former task is finished. In such case, dynamic task definition is required. Dynamic task definition can be written in make file by dynamically creating make file during the execution of make file. I feel it is tricky because it requires another tool such as AUK to create make file. Then scientific work requires powerful and flexible workflow definition language. You probably know the solution. What is it? It's Rake. Rake is a build tool like make. It has a feature of powerful internal DSL. It provides programming power of Ruby. As you know, well, task definition in make file is like this. Then the workflow graph is drawn like this. Programming power of Ruby provides quite powerful abstraction of task definition. This example shows the definition of multiple tasks by writing it in a loop. What about dynamic task definition I showed before? What do you write it in make file? To achieve dynamic task definition, I considered task definition in task action. In this example, task B is defined in the action part of task A. However, this example does not work since no task depends on task B. How to fix it? I use invoke method of task class. Here the task B object is assigned to the variable B and then task B is invoked immediately after its definition. This example actually works well. This is one of the ways to achieve dynamic task definition. Another issue is parallel execution. Rake has a built-in multitask method. If you define your task by multitask method, its prerequisite tasks are invoked in Ruby thread. However, multitask method has a problem. The multitask method creates the same number of threads as prerequisite tasks. If there are too many prerequisite tasks, threads consume too many computer resources and it causes system trouble. Delake is a solution for this problem. Delake has a feature that only the specified number of tasks are invoked in parallel. Delake has another feature that all the independent tasks are automatically parallelized. It means that the multitask method is no more necessary. Delake implementations like this. At the beginning of execution, Delake starts the specified number of worker threads. Delake repeatedly tries to use the task tree and gets one available executed task. After Delake includes a task, one worker thread decues it and executes it. When the task finished, worker thread encues to the queue, which is used to notify the next timing of task finding. Delake is not enough for the scientific workflow system because Delake does not have the following functions, remote process execution, and second dynamic task definition since Delake does not allow invoke method. Delake has a performance issue. This is the summary for now. We need powerful scientific workflow tool. For this purpose, the existing one is Lake, which is powerful writing tool and Delake, which enables parallel execution. The missing one is remote process invocation for distributed computing and scalability, which is required for the workflow system. Our approach is to employ Lake file as a workflow definition language and to develop an extension to Lake called Lake for parallel and distributed computing. In addition to achieve scalable IO performance, which is an important issue for data intensive workflows, we employ the GFarm, a wide area distributed file system which I will show you later. This is our work, Lake. Lake is a parallel and distributed workflow extension for Lake. You can access to the repository at GitHub. Lake features are the same syntax as Lake, which means that if a workflow is written for Lake, it also works with Lake. The second one is that Lake paralyzes all the Lake tasks defined by task or file method. This feature is seen as Delake. The third one is Lake replaces SH shell method to invoke remote processes via SSH. The final one is scalability. We employed SSH for remote process invocation because it is secure and SSH port is probably accessible in most cases and I implemented a simple SSH connection class due to performance issues. Here is, we take a look at how plate tasks are parallelized. The implementation is based on worker thread as always Delake. However Ruby thread uses only a single core but not multi-core due to GBL. Here we know that process is invoked from shell method used multi-core and it also distributed to remote host through SSH. The plot shows performance evaluation workflow consisting of 1,000 empty tasks. We see a performance improvement from Delake by factor of 10. Now, some of you may be curious about file access in distributed systems. Our approach is to use distributed file system. Distributed file systems provide not only file sharing but also consistent file timestamp which requires for skipping tasks based on timestamp. Many of distributed file system also provides optimized IO performance. IO performance is important for data intensive scientific workflow. The left figure shows the case of NFS file system. In this case, files stored in a single storage are accessed from multi-core simultaneously. So storage IO becomes bottleneck. Instead, as shown in the right figure, access to storage can be parallelized. This improves scalability of workflow. Therefore, distributed file system is necessary for parallel and distributed workflow system such as break. Among many distributed file systems, we employ the GFarm file system. GFarm is a wide area distributed file system. It integrates local storage of every computer node and provides a file system with a single global namespace. GFarm is so secure as to integrate geographically distributed storage. GFarm is developed by Professor Osamu Tatebe at University of Tsukuba as an open source software. Since GFarm uses local storage of each computer node, it has a unique feature that excellent IO performance is achieved when it uses local IO. In order to use local IO, the process must be assigned to the computer node where the accessing file exists. Now, this task assignment is a role of workflow system, so I implemented the function of locality-aware task assignment to play. This figure shows the diagram of locality-aware task assignment for play. The point here is that there are subcures corresponding to the accessing host. This is the result of performance evaluation of montage astronomical image processing, which I have shown before. The horizontal axis is a number of cores and the vertical axis is the last time of workflow, both in log scale. The upper line is the result of NFS case. In this case, the elapsed time increases even though the number of cores increases. You can see the NFS does not scale. On the other hand, the GFarm case, the elapsed time decreases as the number of cores increases. You can see the workflow achieved excellent scalability by using GFarm. Different three lines shows different locality strategy. If the locality-aware task assignment is achieved, about 20 percent of speed is observed. I will show you the demo of play here. This is a demo of montage workflow using a computer cluster at our laboratory. This is a simple demo interface using a web brick. The left images are input images. The size is scaled by 110. This is a lake file to describe workflow consisting of 250 lines. This workflow uses five nodes. Each node has two cores. Push start button here and workflow starts. This is a graph of workflow. Red nodes are currently processing nodes. Press finish and each node will color each node change. Now, tasks are dynamically created. Again, tasks are created. Also, created tasks and processing. Finish and push here. You can see the result image. Plake is under development. Among future plans, we are studying to execute workflow on geographically distributed systems. Also, we made it for Torrent. In conclusion, Lake is so powerful to be used for scientific language. As a scientific workflow engine, we developed Plake and distributed workflow extension for Lake. Scalable IO performance, G-Firm, while the area distributed file system is a good solution. Using Plake and G-Firm, good scalability was actually observed. That is, thank you.