 Hello. Good afternoon. Thank you for coming here. I will talk about this topic about fighting against chaotically separated bodies within bulk. So, okay. So first of all, let me introduce myself. My name is Sada Iwaki Fuyuhashi. People call me Sada. And I was originally born in Japan and moved to the U.S. four years ago. I founded a company called Troja Data. And that is located in Sekonbara. So I'm one of the three conformers of that company. And I'm also an open source hacker. And I created several projects. One of them is called this friendly. This is a low collection tool, like syslog, but collects data in structural format, which is JSON. How many of you know about friendly? Do you know about friendly? Oh, okay. It's a very good time for me to introduce you about this. Then another thing is in bulk, I will talk about this today. This is more like for structural data, for virtual-oriented data collection. And another project is called Message Back. This is like JSON, serialization format. But because it's binary based, it's faster and more small. Do you know about this Message Back project? Some of you. Okay, great. Thank you. So today's talk is about embark. So what's embark? Embark is an open source tool to load data in parallel. So loading data means moving data records from A to B. A can be like storage, services, or most equal databases. And B also can be storage databases or services. The point is to use plugins so that you can plug in the source or destinations so that you can support more variety of the storages. And then the goal is to make data loading easy. And that was very painful before. So the reason why it's painful is this. So for example, you have 10 gigabytes of this file. You want to load that to a possible scale. Well, in our company, Toria Data, this always happens. People import data, people import gigabytes of data every day and they want to analyze the data. But there is a problem. So first you run a script and load data to a possible scale. But thoroughly it fails. That's because there are broken records and time format is not expected or there are some unexpected lines. So why, for example, that one tool supports this time format, but does not support this time format, which is ISO format. Then some tool generates CSV file with this box at the end. But some tool doesn't recognize it. So we need a normalization to compile those data. Then run again and it probably fails again. Because another problem, for example, in later record, you found inf as a data. But we want to convert it to infinity so that it can be a full out value infinity. Then fix it and you need to retry again and again. Then maybe you find that some data are loaded twice accidentally. So it's so painful. Okay, then anyway, we succeeded to set up a script today. Then you register the script to Chrome so that it loads data every day or every hour. One day it was somehow failed with another problem. For example, the file includes invalid UTFS code and the script just stopped and the data was not there. So you don't have to think about this. What we want to do is just data loaded there. But actually because of the data format, this problem happens. So we want to avoid it. Another problem is this. Let's say you have gigabytes of files for many, many files, large files. Then oftentimes scripts are slow because you don't want to spend time to optimize the scripts. But with a 9-way, if one file takes one hour, it will take one month to load everything. It never finishes. So it's also a big pain. And there are a lot of data formats in the world. One is external, one is JSON, one is load format. There are several scientific data formats like BGD, F5, PDE, second file. And there are also some main storages recently. It's a MongoDB elastic search with the script. They just get introduced recently. But you don't want to rewrite the scripts every time when you add these storages. So the problem is the difficulty of the first file is correct. And even the CSV file has many variable kinds of formats. And error handling is difficult. And transaction mode, or like idempotent we're trying, is hard to implement. But without that, you might import data twice, you might get it accidentally. And hard to optimize performance. And there are many formats in the world, and the storages in the world. And as a company, so we provide cloud-based data management service. So you import data, you can query at the end state using SQL or apply machine learning on that. And we push the results back to your system. So that's a system we provide as a company. And what we found is that customers wanted to write your data quickly. But because of the data format programs, storage format programs, they took time to import data. And we need to write scripts for them every time, every time. But it's hard work. We don't want to do that. And one tool, Wendy, solved the streaming data collection. This is another episode where I started. This is used a lot, especially in Japan, actually. Big companies like Nintendo, Google platform internally used as friendly, or Amazon Web Services also used friendly. But this tool is for streaming data collection. But we also need batch-oriented data collection. One time 10 gigabytes loaded to somewhere. So that's different problems. So a solution is this. So again, block is a source project. Cloud-based power data loader. That makes data loading easy, a lot easier. This cloud-based, this is very important. So let's again, and cloud-based means that data sources and data destinations are cloud-able. You can add plugins or download released plugins, open source plugins from the web. Then, so those plugins support several kinds of data inputs. And they involve itself as a framework. It's a reliable framework which takes care of retrying or connection between the plugins or transactions or power executions. Plugins don't have to think about that because Embark as a framework provides that. So let's check demo. Like demo. Okay. So support this Embark already installed. Then, Embark, this command Embark example generates a sample configuration for you. Embark, example demo. Then it generated those files. Then let's see this file. So this is very typical CSV file, compressed in this format. Then it also generates a configuration file which includes path to this file. Then you run Embark, yes. Yes, Embark, yes. This command reads some contents from the file and gets to the format and generates a proper configuration file for you. So this time, this CSV file has comma deluminated and quoting a character is a repetition. The first line is a column names. And there are the column names. The first column is IE, the column is interior. The column is like this. Time to start with IE is also guessed. Then there are two time formats. One is I1, two is type purchase. And different type formats. Time formats. This means that Embark guessed this timestamp format and this timestamp format. This is actually timestamp 2015 June 27th. So writing this configuration appropriately is hard because you often miss some configurations. But Embark guesses for you so that you can start from here. Then next you will run Embark preview. This commands reads some data and provides you how that will be written. So this time this is like this. Then if you think this is okay, run Embark run, it actually loads data. Yes, like this. So this time no, no, this is not. As an output, I used a plugin called Tandada Out. So it just dumps everything to the console. But this is plugin-based again. So you can use other plugins like PostgreSQL plugin. So let's use this. So I said type PostgreSQL. The host is here. And Embark run. Yes. So it outputs a lot of messages. That's because this is parallelized and optimizes it only. Then you can find those data in this table. Oh, it's, yeah. Then the count type is also set up correctly. Those two time counts are loaded successfully. And this CSV file could be more complicated. For example, let's say, usually CSV files have this quoting rule. But often cases, some CSV file has this escaping. And oftentimes CSV file comments like this created by Sada. Or there are some comments. Yeah, it's a comment. Yeah, this happens. And this like this. This always happens. But even in this case, Embark can guess the file format for you. Yes, like this. So SKP is short. And short is the mark of the comments. Then Embark run will load the data correctly. I should say Embark preview. Yeah. So data is no here. And the quoting is successfully parsed. And or if data includes broken data like this garbage, Embark automatically detects it and keeps that instead of just making them failed. Yeah, stick this line. What you want to do is load most of the data successfully and put the broken records aside so that you can take care of that later. So Embark handles those problems. Okay. So this is Embark spiling architecture. Embark itself is a small framework. And most of the features are implemented as plugins. So one is in plugin and one is out plugin. And there are also filter plugins which converts the data structure or skip records or skip calls. And there is also guest plugins which provides guest single file and generates configuration file. And within this input plugin, this is also separated. If the data input is record-based such a possible scale that is just input records. But oftentimes you have file-based inputs like CSV file. Then in this case, Terrain is also divided into three types in terms. One is file input. This is like local file, Amazon S3 or UTP, FTP. Those plugins read data. That's read data. The next step is to decode it. Like decompressing it using Zzip or other compression formats. Then the final step is to parse the format like CSV parser, JSON parser or other formats. Output is also out of the same. It internally has format to format the file format. And then compress or encrypt it. Then write the data, write the actual plugins to the server. Then this is example of input plugins. Also scan my SQL record bar by this. So they are all released as plugins on the web. So I think you can find most of them and plugins are there. And parser also has many supports on the web. Open source plugin projects. Good thing with plugins is that open source is that you don't have to write everything from scratch. You can copy from those plugins and create your own. Or if you find a small program, you can contribute to those projects as a progress. Then those small efforts pile up, then other users can take advantage of your efforts. The output is also the same. Yes. Finally thing is that someone said I want to put the data to XFile and he created a plugin. They actually used a lot. And there are filter plugins which converse the records. Why is to filter comes out? They say if the data input includes password code, you want to remove it before putting them to cloud services. If a column includes JSON, you want to flatten them into columns. Or converting user agent fields, like firefox, something important to browser names, OS name, etc. That is also implemented as a filter. Or converting queries. This is very useful when you have a big access level data and flatten them into fields before routing them to databases. Another useful feature is hashing filter plugins. With this plugin you can hash user names before routing them to databases so that you can include the data and no one can recover it back to the original data for security. But still you can merge the data across different sources. The one actual use case is like this. You use embarking process scale plugin. I created it. And use another filter plugin. Conflictor. So remove some cost. This is created by another one. And apply encrypt filter so that you can encrypt the user ID, user name or those sensitive information. Then put them to elastic search using embarked to elastic search which is written by another person. And use case 2 is loading a lot of system files stored on Amazon F3 to cloud-based data analytics services. Cloud-based analytics is something like Georgia data, we, Google BigQuery or Amazon Redshift. You can embark to load data from SD to those services. Then an interesting plugin is called this executor plugin. So embark itself actually does not execute plugins. Embark calls executor plugin to execute plugins. So this is a sort of meta execution plugin. And using this embark executor, map produce plugin, this plugin distributes the tasks, of course distributed machines. Then load huge data using those distributed machines in parallel. So using this, you can load hundreds of gigabits later. As a service, we also provide embark as a service. So that we can call REST API to run embark. And then embark runs to load data from your databases or process data and push the results back to your database through embark. So maybe internal architecture is a bit complicated, so yeah, I think this is good. Uptrubbing implements transaction and tasks. A transaction is the entire control of the one bulk load loading. So it has begin and come out. Then at the begin step, plugin creates tasks. Then embark take it and run them in parallel. Then once everything completed successfully, the commit stage is called. Then this commit stage actually counts the data to the storage. So with this common API, you can create any kind of plugins. If one plugin fails, commit is not called. Instead clean up or clean up method is called. So those tasks are executed in parallel. By default, embark uses a plugin called embark executor local thread. It uses multiple threads to run tasks in parallel. Then tasks are run in queue so that threads can take those tasks and run in threads. Then Mapples executor uses Mapples here, which is distributed computing instead of threads. Mapples executor also supports part sharing. So you want to load data, you want to portion data part hour before loading them to HDFS. In this case, this embark executor Mapples takes care of it. So you can sort data by time and separate files for each hour so that you can skip unnecessary files when you analyze the data. So the first release was 2015 February, version 0.3. I added resuming functionality there. It means that when a single task or several tasks fail, out of many tasks, you don't want to retire from zero. Instead, you want to resume. So embark support. I added support for that at the first release. At 0.4, I added plugin template generator which means that embark new name will generate a template to develop a plugin. It's actually already working plugin so that you can modify the template to create new plugin instead of writing from zero. So it's easier to write. I also added incremental data loading. This means that you run command first, then next run you load new data only so that you can schedule it to sync data from one data source to other destinations. Yes, this liquid template engine is added at 0.6 which is that like this, in YAML file, Confirmation file you can include those parameters so that you can reuse one single file but into different variables from the environment variables. This is useful to reuse the configurations for many use cases and at 0.8, I added JSON count type which means that a count can have JSON as a schemer's data. It's similar to possible skills JSON type support. In the future, I'm going to add some more like error handling plugins. Currently, whenever happens, it keeps our code but it's not always desired behavior. I want to make it customizable so the idea I have is to provide a new plugin type called error plugin. I expected error plugins that send a notification email if it finds an error or sends a message to chat or if the error ratio exceeds 0.6, it fails otherwise put the data to another file. So those plugins should be possible at the next release. There are some interesting hacks but rather I think I should accept some questions. Please ask those things if you are interested in. Thank you very much. Do you have some questions? So Embald Guest will generate a table schema. If you have already defined your table schema using JSON or the web how could Embald to use that? So the question is how to reuse the existing schema. If the data source is full database, you don't have to put schema on the convolutional file. Instead if that is not schema, like CSV, JSON, Mongrelian, Embald requires those schema. Embald doesn't have the support for reusing existing template to schema. So some people created a tool to convert those to Embald in the convolutional file. Read Excel file. Read Excel file. Good question. So there is a website called Embald Plugins. Here is list of plugins. And let's say Excel. Yes. This plugin reads Excel file. You mentioned other currency. Is that a guarantee across all the different sources and destinations? So idempotency is guaranteed within a single output. So for example if that is a database, it actually runs the commit to guarantee the idempotency. But it doesn't guarantee the idempotency between input and output. If you change state of the input as well, it's not totally guaranteed. But most likely it works. So thank you very much.