 OK. Sorry for the delay. We were setting up some Windows environment. So this is how I'm going to present. So I hope you all can see this screen. OK. My topic today is building your Spark Merchant Learning job online like Lego. So what does this mean? This site, we call it Spark Merchant Learning Workflow Visualization. So this sounds most folk, but as I explain more and more, you're going to learn how each of the words means. OK. Before I speak, I want to introduce a bit about myself and my company. My name is Hu Dawei. Or you can call me David. And I work for Qbida. I'm a system engineer. So Qbida is an enterprise analytics company. So we provide one-stop enterprise analytics service. It's based on open source. And all these Hadoop Hive Spark we are the mainstream technology we've been using them. And also it's data source and platform agnostic so that you can basically connect any source of data and output it to anywhere. And these are our main features. So we connect your data to the Hadoop to leverage the power of a Hadoop. So from there, we start to discover your data, transform it, and figure out valuable information out of it. Then at the end, we output the analyzed report so that you can have this nice intuitive web interface to know what your data really is about. All right, this is the intro. So let's jump into the topic here. Spark Merchant Learning Workflow Visualization. So what does this mean? So it's basically you can create and modify and execute a Spark job with just a single drag and drop feature through a website. And all happens immediately. So I will give you a quick demo first then you know how it looks and I will explain the details and the technology we've been using to support such a model. OK, so I will show you our cluster. So as you can see here, this is a workflow. So you can see it's all being linked, connected together with this directed async graph. And all the work from the reading a data frame to eventually apply different machine learning algorithm. And get your result, get what you want. And this one I have already ready. So you can change it like this so that all these parameters you can change and also for those things. And you can even see the preview of intermediate data sets so that after each step you will know what happens to your data and does your purpose to build the workflow is correct or not. All right, this is going to take a while because it's running actual Spark job. So I will go back to the slides. So this is a structure, how the structure looks like. As you can see, we read the data frame first, then filter some columns by input some conditions. The conditions meaning you can just put a SQL script. So Spark transformation, which accept a raw SQL script. And all the results are having this Spark. We wrap all this Spark API into this nice intuitive interface. OK, so here we have a couple of examples of how you can customize your nodes. Take this for example, which is SQL transformation. So it requires two parameters. One is called data frame alias. So in here you input one. So this one will create a time view in the Spark cluster. So this one following the second parameter will be able to reference this one from here so that Spark will know how this data frame is being connected. And here's another example, the split. The split is to a so that we have a data frame. You can split into different two set of parts that based on those parameters say you want to be random or ratio or seed. The result preview, as I introduced before, it is for the intermediate result. And of course, also you can show the final results. Maybe I can show you about this part. You can see here we have many other nodes that come in from here, those are called categories so that you can combine different kind of nodes from different categories to do what you want. So that if you know about Spark, you know that when you go to your code, you're going to start to write your code and using other APIs. So here is that you can just drag and drop and without running a single line of code. How does it work? So as you see previously, this structure, when you run it, it immediately runs. And even though it's running from the web, it does actually launch the actual Spark job on the back and on the cluster. And secondly, each of the nodes is being parsed into an exposed Scala code snippet on the fly so that each node represents a part of the Scala code. So once you change the parameters, and the Scala code is going to be changed. And the thirdly, the code is sent to cluster by HTTP post request. So now you might be wondering how this is possible, right? Yeah. So here is if you understand how Spark works, you will have this naturally, you would have these doubts that this thing might not be impossible. Because why? Currently, there are two ways for you to run a Spark job. One is through the Spark shell, another one from a Spark submit. So a Spark shell allows you to open a shell. So in there, you start to write your code Scala or Python or R, right? So I believe I don't know how many of you have been using Spark with your hands. OK, not so many. I guess I need to introduce about the Spark. So Spark is a computing engine that based on the Hadoop. The Hadoop is a distributed file system so where you store your data from different servers. It's different from the retinal, what's the word again? So like Oracle or Postgres. So those are structured data. But normally, your data is not so structured as per se, like CSV or JSON or even image, some kind of stuff. And they are not that important. But there are values out of it, but they are not that important. So you don't want to store them in a database that's so expensive and so straight. So the file system is a natural place for you to place those stuff. So HTVS Hadoop is based on this idea. So now we come to Spark. Spark is a computing engine on Hadoop. So which means that it can gather all the data, read the data from Hadoop and doing all the transformation and filtering and other stuff so that just like how you do to a database, the data in the database. And that's the basic background about Hadoop and Spark. So now two ways to do a Spark job. One is Spark Shell and another one is Spark Summit. So Spark Summit is that you need to write a Spark application first, I mean a truly complete Scala app and you need to pack it to be a jar. Then you sum it with the Spark jar, the Spark Summit. So Spark will do all the like uploading and figure out the dependencies and et cetera, et cetera. So now here comes the limitation that all those two things require you to programming manually. It makes sense, right? So if you want to go to run a to using Spark to analyze some data, you need to get your code first and go to Spark Shell to paste there or just manually typing or you build it as an app and submit from the cluster. And this way, the second one part is going to require the cluster access, which is actually quite limiting because not all the people are administrator. So they are not familiar with the server environment or something like that. So this actually very limiting the power list of the Spark. So it makes you think that is there a third alternative way to that get a full use of the Spark and also make the access to it. It's being easy and accessible from anywhere. Now you might be thinking the third way. Naturally, if you've been a developer, you know most of the request communication by the HTTP request, right? So here, if only we could just use the good old HTTP request that is easy to use. So standard for heavy devices, you can use your phone, your browser, or whatever tablets. And it's easy to access. So you can use any kind of tools like Kerr, even a software like Postman. And it's lightweighted. Whenever you want to make a pull request, you don't really need to add any dependency to it. So because most of the language and tools are already having this supporting the networking, right? So this is the good things about HTTP requests, the web. So if you think about it, though, is there what if we can combine Spark with the web interface? And that's what Levy does. So introducing Levy. Levy is a patch project which is still ongoing. But it has already really nice rich features that you can use. So it is a web service that enables easy interaction with Spark cluster over a web interface. What does this mean? This means that by the API provided by the Levy, you can use different ways to submit your Spark job to the cluster. One is to pre-compile jars, which you already have, that are good. OK, we are away. So we introduced this snippet code, and certainly the Java Scala client API that where you can write your Scala job embedded within the app and using the API to submit it. And all this stuff you can find on their website is very detailed and very well-organized. So you can read about it later if you are interested. The second feature is that the Levy is ensure security while the secure authentication communication so that you don't need to worry about your data being linked because that it's sending through the internet and it's easy to integrate with a lot of authentication libraries on the Hadoop, like Ranger, something like that. Or you can catalyze with your service. And certainly the most important one is sending most of the APIs as asynchronous requests. What does this mean? This means that once you issue an HTTP request to the cluster, it returns the response immediately with a body to indicate the current status of the cluster. So it will not wait for your job to be finished and it's going to give you a response. So if you think about it, it's quite reasonable. Most of the jobs are actually running from minutes to hours, which is very time consuming. So this one, by doing this, you can immediately get the response and figure out what's the next step to do based on the response. All right. And here is the Levy infrastructure. As you can see, here is the HTTP interface. And here is the client, which you need to embed it, integrate it with your application. And sitting between the client and the cluster manager, which can be most of it is yarn. And we have a REST server. So whenever a request comes in, and the request server is going to sign this code or jar directly into the cluster manager and starting the new job. And it's going to keep getting the latest response. So keep getting the latest status from the cluster that you can use. That's how you get the information out of it. So now we know the infrastructure of the Levy. How do we integrate with our current design, which is the machine learning workflow? What we did, if you want to integrate with Levy, what you need is to enable the Levy service in your cluster. What you need to do is to download the Levy and set it up. And once you set it up, you will have a Levy URL and port where you can communicate with Levy. After that, you can send the code snippet to Levy REST server by the post request. And as we talked before, it's asynchronous. So we will do the post status for the latest executing session or the statement. OK. Now the next step, we'll check about how the API looks like. It's actually quite simple. There are a lot of APIs. But those two are the most important ones. That if you want to use Levy, first one is to create a session. It's very simple that you post slash sessions to the Levy endpoint and passing a body with a kind parameter. And by default, it will be Spark. If you want to use Spark, you can ignore this one. But if you say, I want to start the session in Python or R, then you need to specifically this to be PySpark or Spark R. All right. Now once you post it, it would immediately give you a response in the body that will have an ID of the session. So here, you can use in the second API. This one, if you call this, it will give you the status of the current running session. So at the beginning, if I create the session, it will be busy starting. Once it's been created, it's going to show you it's been created. Now you can do the following things. OK, the second one, which our platform is tightly relied on, which is creating a statement. Similar like this, it has to post a statement. And in the body, it only requires a code. And you're passing the code you want to run. It's plain simple. It's just a string. A string that represents a Scala code. Or if you prefer Python, you can write Python code. But I'm familiar. I'm good at with Scala, so I just choose Scala. So now, similar like the first one, if you post, it will return a statement ID. So you can use in the same RESTful API. You get the statement of the last sending code. Yeah, so giving those information that how the review works and how our workflow works, then you might have a clue at how those things being combined together. So let's review this whole picture here. So this workflow is combined with different type of nodes. You can think of the node to be a configuration that eventually going to create a piece of Scala code snippet. But how? From a visualized node to the actual Scala code. And here is if we review the previous one, we can figure out the relationship between all the structures. And from there, we can start to design our data structure. So as you can see earlier, it workflow consists of a sorted node sequence and metadata. And metadata is like a description, the ID, or name that you just for the other information. You don't really care about if they don't really matter when you run the Scala code. The node has a certain input and output with different types. So go back to here. You can see. Let's take a feature column as an example. It has an input. It doesn't say, but the input is a data frame. So we use different colors to indicate different type of data frame. So data frame is a you can think of a abstracted table, like loading to memory that gather the data coming from Hadoop. So you can think of the data frame is the main abstraction of how Spark would transformation and actions. All right. So it accept an input. So a input from a reading data frame. And it output two things. One is also a data frame that I want a sub data frame. I do some filtering. Then I output the result. Makes sense, right? And it also outputting a transformer. What is a transformer? It's a reusable structure that you can apply to the other data frame. So let's say this configuration, this filter column is so powerful, so useful, I want to reuse everywhere. But I don't want to create a node for all over the place. I don't want to duplicate that. So that if you don't want to do that, now you can connect it to a different data frame than do a transform like this one. So that that's the way how you can reuse it. Enough of the details. And this is a workflow. And they are the nodes. And with these nodes, sorry, one more thing. We are talking about two type of output here. Data frame and transformer. But for machine learning, you also have trained model or like the feed one. The feed one, once you feed is data set, it produces a model. Or some other, like this one is a metric, which is just a single value that indicates how good your model is. All right? The child nodes need to know the parent node output type and the position. What does this mean? Go back to the example here. You can see the split. It also accept the data frame from here and split the data frame into two. Now we have a node that produce two data frame. So that let's say a child node so that not only need to know the parent nodes type, which is data frame, but also need to know the position of that so that they don't get confused. So by this one, you know that I need an ID is this and a type is data frame. And output ID is two. Let's say it's two because one, two here. That in this way, you can identify the exact parent of the child node is. OK. Each node need a code generator. What does this mean? As you can see earlier, each code has a ton of, I mean, not ton of, but how do you put this? Because based on the purpose of each node that they do all sort of things, right? You do transformation, do machine learning, and apply different kind of algorithm. And also, you want to output the data into different places like Hive, HTVS, or S3. So all sort of different purpose. So naturally, they would require different parameters. So how do we combine the code and parameters together to generate a valid Scala code? And here is our data structure design. Before I introduce, anyone here know about Scala? OK, not so many. All right, so I think it's still OK to indicate how it works. A case class is a kind of a class that only contains data. You can think of it that way. So a data structure, a struct like C or whatever other things is. If you know about Java, don't have this. But Kotlin has a data class that was whatever you know what it is, right? So case class, we have a workflow. It's ID, name, and list of nodes. So you can see the type here is Victor. Victor in Scala is an immutable array. You can think of that. It's immutable so that once it's been created, you can never change it. And the list of nodes, and each of the nodes have this structure. So you have other ID, which indicate which one first, and which one the second. It has a name and a category so that you can see one more thing here. So you can see we have different categories and having different names. So the output would be the category, and the name would be the actual node name. But in this way, we can locate a node that precisely in our code. OK, here. Code-generated design. In our case, we call it statement similar like the one you see in the Levi API, create statement. But in this way, ignore the name. You can think of a builder that taking a node as a parameter, then output a piece of string, which is a value Scala code. And you can see, because we have all different nodes, so we want to have a common interface to ensure that all the nodes having this similar behavior say that they all need to be having this snippet interface that as long as this node implement this statement, they have to implement this snippet so that it will give out a code string. And here, we have parallel output types, self-output types, and prime keys, and the snippet. Those three things are the first three things are actually like serve as a validation role, because we need to read the data from the database and convert it into the previous one, sorry, into this kind of structure so that in the statement in the builder, we want to validate as the data is valid or not. Why? Because when we collect all the configurations into the DB, we've been using MongoDB. We cannot really guarantee that if the data input by the user are valid or not, so we do a double check here. So the last one is one that matters, and this one produce the actual code. All right, before you start the structure, sorry, yeah? I have sectors of strings, not sectors of all types of all Spark types. You mean the output wider? You mean here? I call it Spark types. Actually, that's a very good catch. Right before I do this presentation, I do this snapshot. I notice that string is a bit flexible. Is that what you're concerning? Yeah, so type as we mentioned earlier. Yeah, so a better refactoring can be we use object that object is quite like a class, but a single thing class, and it's very, it's type guaranteed. So we can rely on the strong type system in Scala, not by this string. String is too flexible, even though we do check if it is comparing to a data frame or other stuff. But yeah, that's a very good catch. Thank you. Here. Here? Best design when you need to instance a different statement. Those different statements will be, you will have a class with a huge statement for different kinds of things. It's better to have a trade class that is seen. So you implement the statement everyone inside the same class file, and you prevent the problem when you try to do a lot of matching, because you can use the standard match. OK, I see. So your question is, how do we, based on the kind, based on the type, category, and the name, how do you figure out which builder to use? Is that your question? Or you are? For the statement class, why is it not free? Why is it not, sorry? This statement class that you write can be converted to a trade. We are there in the node mode. I'm sorry. I'm not quite getting the question here. Abstract class, if you do a still trade, it's better also in order than the other code, because you must. You mean why it's abstract or? Why is not free? Why are we using a type safe approach to do these kinds of, the statement, you are using the statement as the base for providing the code that will be sent to the code. So the most important part for you is to have that the statement is much more possible than type safe. One problem was the key parameter. You can use, for example, Enumeratu that Yeah. A better approach for the keys. Right. So we're using a seller trade statement. So you will define all the statement inside the same class file, Scala file. OK, actually, I'm only showing part of the statement part. I also have a company object, but the screen is too small. I think that's not irrelevant. Maybe I finish this thing. And actually, I'm not quite sure what your question is. So maybe I finish mine. And we can discuss more about how this design approach. So here is the abstract class statement, which is a builder so that for different kind of nodes, you will have different kind of builder to inherit this one. And in terms of, OK, let me show you the code first. So previously, you see the code structure. Now you are wondering, once I get a node from the database, how do I find the right code generator or builder? One way to do that is because we have in this name and category, one way to do that is to have a big match case case. And this is going to keep continue growing as long as you have more nodes being added into or by something else. Here you can see we are using a reflection, which is a Java concept that you dynamically find in the class during the runtime. So how do we do that? So you can see here, based on the category and name, we find that this defines a class path for that builder. And here we have this concrete statement. Now the following would be find that statement and the class constructor. And eventually, I create a new instance of that statement. You can see here, I pass this, which itself is a node. And you can also check out, if you remember, the abstract class statement as a second node. So in the node, I pass myself into the statement I just found out based on the information, name, and category. We found out. Now we have a statement instance. And the statement instance would call snippet, and it immediately returns as it calls snippet. Make sense? All right. OK, here maybe this is an example of how this co-generating from raw data from database into a real code value scalar code. Here you can see, here I have a node, which is a node number three. Order ID 3, you can see. And it have category transformation custom. And we're going to convert this underscore into a dot so that match the Java mechanism. No name and description, you can ignore it. Object, just an object. Output, which is an array. So this means that this node cell cycle transformation would output a data frame, as you can see here, which itself is a string. The parent outputs because one node can have one or more or no parents. So that is also a list of objects. So the object structure is ID, parent ID, which is order ID of the parents, actually. And output ID indicating the order of the parent output, like the one. Remember the example I told you, this two same type of the data frame. So you need to figure out which one is the correct one so that output type is data frame. So this is the raw data stuff in the Mongo. We've been using Mongo because it's quite convenient to do this kind of stuff. Here is a SQL transformation just for that node. As you can see here, we have all those three because inherit the statement. Now we need to override those methods to make this node work. The cool thing about Scala is that when you define a method in the abstract class or a treat, in the subclass, you can actually read not using the method, but using Vow. So that will save you from repeating complication that whenever I need this, I don't need to recompute it, but directly compute it once and cache it. Next time, I directly add the value. So here's a snippet, the most important one that here. This is a template. So based on all those parameters, I have got this parent data frame and self-dataprint output and produce a code that which is valid Scala code. So to answer your question, the stream part, right? Here you can see. Yes? Of course. Sorry? Type. Yes. Yes? Yeah? OK. So yes, Scala has a, you can define the function type. Yeah, of course. OK. That's the reason I designed like this because in Scala, you can go really deep and using all these nice features. But for me, I think this is the most understandable way to implement this one. I think it's, forget about this Scala code, but even if you are a Java developer or a Python or Ruby, I think this, you can implement this same idea into other places. I think not because you can define a function in Scala as a type, you should. I don't think you should use that to just prove or just showing how good this code is. But I think for me, the readability is the most important thing. Yeah, so that's my approach. That's why I choose this approach over that one. OK, so here is all of these are big technicals. So if you have questions, we can discuss later. So a simple one, a previous one is quite complex. And the output of the final Scala application can go to hundreds of codes. So I create a small one that what it does is quite simple. Read the data frame, do some transformation, and write into a high table. Right now, the write data frame before goes to a high table. So here is the output of the Scala code. So as you can see, if I piece all of this code directly into the SparkShall, it would run perfectly. Because the Levy maintains this thing is you can think of like online version of SparkShall. Online version of SparkShall. So you can see the name I define. So this is the first node, second node, and third node. By giving a standard format of the variable, I can reference the parent nodes easily. So all the data frame outputs having the same structures that start with output and with the other ID, indicate the nodes, and the type, and the output. So in the second one, as long as I know the parent type, and ID, and the position, I will immediately calculate how the parent node can be found. So here I have been reading the data frame, select all from these transactions. And I want to find all the CDA, and I write it into CDA transactions. That's simple. OK. This project is not all the ideas that come in from us. So this one, the C, there was an open source project called Seahorse, which does similar things like what we did. But we've been being in parallel, so we don't know each other. But once we found out, we are very excited. And we go to their website and check out their repo. And the same situations. What they did is actually go really, really deep. That together with the JVM and the Hadoop platform, and they redefined a lot of things. But for us, and you can, if you go to the project, it's actually lasting almost two years and hundreds of man powers to developing these things. For us, I want the same feature. And by studying the JVM, I figured out we can do it in a way simpler ways, which is we manually hard coding every template of the operation so that we can put them together as you want. So if you want to know how the full feature, you can go here and check it out. What we did is similar, but the way is there is only cost like three months. The back end is all on me. So yeah, the design and the back end implementation, that took about two months. So comparing one month, two months to a full two years, I think that's quite an efficient way by using DV. OK, take place. What do we learn? Do we learn from this project? First of all, the very impressive one is Levy. I think it's really opened a lot of possibilities to how you build a Scala job. So it really unleashed the power from previously. You can only log into the server and write your code and submit it. And in this way, Levy is much more flexible and more access. Second one, during our implementation, because we need to wrap heavy like most of the operations using Spark APIs. So we found out most of your needs actually it's already been considered in the Spark API. So I think the most impressive thing is that they really started well and really performed. The performance is so good that I think every time you want to implement something you own, you want to do some data, think twice to see if Spark has it already. Go there, documentation, list, and have a check. If not, I mean 99% of the chance you already have something there. And thirdly, the immutable data structures, which is not a common concept outside of Scala or outside of functional programming work. But it's very important that you know this because once you force yourself into a think of the data as immutable transformation, instead of getting a data, I want to modify it, then you can see your whole idea about programming can change. That's why you can see here in the code I showed you, I only use the immutable actions like Victor, map, or set, those kind of structures that once you create, you can never change. And all my code are using VAL instead of VAR. I think that really guarantees that in the generated Scala code having very few bugs. Because the second layer of the Scala, you can see here, this one is actually quite small. Some of them can be close to hundreds of lines. Take one, for example, we have a node called handle missing values. It's a common operation that if you do a static, which is that if you have some rules that have no values and you want to do something about it, you can replace with a custom value, or you can replace with the mode of the column or the medium of the column, that kind of thing. So for each different structure, you need to calculate that value to field. And that is to take a lot of effort to build that. This is a quite simple one, but this serves the demo purpose. All right, here are all the references I was talking. I mentioned earlier the Levy Apache project, the DeepSense AI and Seahorse, and our previous talk by our CEO, Britain, if you're interested, you can go to it. All right, that's all. Thank you. Any questions? Which part? Oh, this one. OK, this, yeah, of course. OK, here are lots of methods. That's the statement. Those are the APIs. Those are the methods that each of the child builder need to be implementing. But there are a lot of helpers, and this instance method need to, maybe I can show you right now. Is taking the single string. Yes. So OK, here's a. The side is taking a single string. Yeah, so. And retour the sample. Yeah. So the primer key is just a list of strings, which is a guarantee that I need two parameters keys. One is the time value and expression. And by doing it, and here we have a map get. And this. Yes. Yes, so yeah. This get is accepting a string. And using this string, you find that value from the database. So this get is actually a method defined in the statement. OK, here are some scholar code. I'm not sure you can follow, but I will try to explain. So this get accept a string, and goes to the from it, and get value. That's simple. Return the stuff. OK, sorry. On your picture, you have a graph of operations. Yes. It looks like the. That's a illusion. It looks like you can send them nodes that have no relations. So you think you can accept it, but we actually have an order defined with each other. So in reality, it's still a linear processing, each of the node. And you define it manually. Actually that part, we have two separate APIs. We have this one is a backend service. For all defining the order and collecting the configuration, we have another Node.js app to do all that stuff. So the right, the configuration database and the read from it is actually separate by two different apps. That's all that's the structure we have. So the order is being defined by the front end. I'm not sure if you have. Some logic on front end, or. The logic. Or by user input. The user input. User must put some numbers in each node. No, no, no need. We have this algorithm to calculate which node should be go first. Yeah, so every time you're adding a new node, I'll remove one. So this new order will be regenerated every time so that you always get the correct one. Yes? So if I have to find a pipeline to define, can you short something to make it a pipeline like the one I cross instead of going one by one? Because you have a code generator. You can easily just generate a bigger code. Yes, so right now it's one. I'm not sure if you care because we generate the scale of code. The scale of code I send to Spark. Yeah, but I didn't get it. It's represented in a graph. And the steps in the graph are aggregated by Spark directly. So if you have some easy modifications, such as fields that are all marked, they are executed in the same steps inside Spark. So it's not required to aggregate them. I think each of those blocks, each node, is a separate Spark. Well, it generates a single one. Yes, so as you can see. It's a single one. Yeah, every single one generates some kind of code. And every one generates a big file. It's your Spark job. It's a compiler because it's in the same size as we need to compile the data. Yeah, yeah. So since it's a run-out, so you can see on the B, how do you capture the errors into any event you control? Actually, yes. So that feature is provided already by Levy. So whenever you make this guest status request in the response body, you can by checking some certain keys, say status, log, error message. So if say the log fails, you already know it's something wrong, right? So you go to the log. It having actually a pretty nice structured log information, you can go there and check. Only polling. Yes, but also since it's just endpoint with the UI and port that, so you don't need to use your code to go to make a guide response. You can just use the ending tools to get results. Or even using the browser or some HTTP client, like Curl or Postman, whatever, you can get a log. All right, any more questions? OK, then that's it. All right, thank you.