 Good evening everyone my name is Wang Wei and the one point is my colleague here so It's a pleasure to be to be to show our to show our project SINGAR here I had a PY data not a go. Yeah, it's a good place to discuss some topics some popular topics about data science here and And The this project was started about one half one year ago in the US and the way We went to Apache incubator last year and we have released three versions and Then we came to the version one. So today we're going to Discuss something about that learning and the training systems under our version one And you may have actually I don't know how many of you have Low deep learning So deep learning actually is very effective for a wide range of applications including as a computer vision for example for image classification and video recognition actually so Before 2012 as I will move actually in 2012 Due to the image net competition and deep learning Shows the outstanding In that competition that learning especially the deep convolutional neural neural network showed the outstanding performance and it is Much more accurate than traditional Algorithms in computer vision ever so after that there are many many people Work working on deep learning and on this area and later that the convolution neural networks and other neural networks in deep learning and was applied to other areas I mean Outside the computer vision including the LP natural language processing and to do the For example machine translation and the language modeling also include the speech recognition and games for them powerful, which is very popular and but For us actually we're from Database group, so we're not machine learning guys or computer vision guys. So we don't Try to improve this accuracy Award to build more models propose more other models for different applications So our consideration where we can say more about the training because the deep learning and use Is big in terms of the training data size? for example, the image net data image net data has one million images and Special regression models trained by by do and Google use a large audio data set and Another aspect the model parameter size is very large The model size once the model is very large Every iteration every training iteration would cost a long time. So finally the training will take days to complete Even on GPUs. So what we tried our motivation of starting this project was to improve the efficiency of the systems to make it run faster to complete the training faster and But on the other hand the learning algorithms are typically Very difficult to understand it has many layers and the many different types of layers and the many hyper parameters to tune Yeah, so it requires a lot of experience. So sometimes actually in our life the other guys in our life They stop to use today they stop to train because they if they found it very hard to tune up parameters there are many high parameters to tune and Finally one more aspect is the Memory so Besides the speed memory is another big problem especially for large models for example, although the model parameter would be hundreds of megabytes, but if you consider the Activations, I mean the feature of every layer and the gradients of the layer and the data set So it would it would easily easily use up all the memory of our GPU card So here's the motivations of this project and Today I will give The talk and which includes the three parts I will give Assume that some of you may not know deep learning where we will also I will give a brief introduction of it and Then we will discuss some optimization techniques used in other systems existing systems and from the our background from the database perspective and Finally is the design and the examples of our new version of the single So yeah, so here it just shows that learning is very popular. It is very hot and it's very effective so many companies are working on this It works on a wide range of data types different types of data audio Audio and image and the text So I will introduce a couple of different models The first type of different model is called a feed forward and neural nets and I think it is The most popular one most widely used a different model We can see that this one is a very simple or basic Different model. It is called actually it is there was a proposed. I think of 20 years 20 to 30 years ago, it is called a mountain their perception I it has multiple layers generally has multiple layers the input layer the hidden layers There could be multiple hidden layers and finally it will have a prediction that is the output layer so the training of this model is like this so you have the You have so so so this work this model works in this way The input data will be forward of propagate to the output layer and For example, this is a simple example the input feature is this x1 and xn and This one the w is the weight matrix So that it will do a linear transformation linear transformation of the input feature and do as in are typically followed by some nonlinear transformations So finally you will convert this feature into a set of probability or set of probabilities So this one probability for one label. So finally you have a probability distribution among all the possible labels. So we will Select the top five or top one Label as a prediction result Another one is more popular the convolutional neural network It is a very effective tool is to capture the local Relationship of pixels for example that the pixels in the images. So it is In 2012 so this model Was the first price in the image net competition so convolutional and neural neural network is a Little bit similar to the multinational perception in the sense that it is Forward it is a cyclic direct graph. So it has multiple layers the this the input the Convolution layers sampling layers and the convolution layers and so on so forth. So it is exactly Collected the graph, but the transformation is different to the multinational perception So this this transformation use uses the convolution operation Just as showing in this figure. So the convolution operation will use a Filter or quad kernel. It is for the image. It is a squared or typically squared Matrix, it will Go through it will slide through the whole image and to do the convolution to get there to transform the features another set of different learning model is the Recurrent and run dates it is more ideas used typically used up for the time series or sequential data for example the Formation transformation or speech recognition. So this data Comes in sequence. So it models the sequence sequential relationship of the data And typically when we train this model where you can we can unroll this model So this is a this is this could be this is the original model So that it is a site that is a cycle. So the input data would be after the first Transformation layer that the giant layer so the hidden feature will be fit into this unit again So but if we unroll this model then we can get the feed forward model So there's no cycle. It is exactly collected and There are other models for example the energy models for them RBM and DBM RBM for restricted bosomal machine and the DBM for deep bosomal machine So these models are walls were very popular in I think after 2006 till 2012 Yeah, and recently is really responsible reinforcement learning is combined with convolutional neural networks and other deep learning models for example in alpha go Solve other applications problem problems So in the following slides, I will use the feed forward neural nets as examples to to describe some concepts So I said so we care more about the training the time and the memory cost of the training So how to train a deep learning model on it is Described as in this way the formalized way formula the FX the X is the input of For example the pixels of the image and the Y is the ground truth of label of this image So the deep learning model the feed forward model just transfer the input X into another feature representation FX or this final representation could be just be the probability distribution or vector of probabilities So the L is called the loss function so it will measure the discrepancy of the prediction and the ground truth and The theta theta here is a model parameters So it denotes all the mod parameters that is to see the parameters in every layer So it is a collection of the parameters So the given odd data set or training data set the objective is to minimize the average in North By changing the model parameters. So the training algorithm Just update or just find an optimal set of parameters that will predict the ground truth label accurately In this figure here the input label is just the ground truth label It is inside the training data set and the image. So the image will be transferred to the predict prediction and So this is called the forward pass and the training would do a back backward pass and So the training is like in this way Typically different models are training using stochastic gradient descent algorithm There are some other variants of this algorithm. So the simple the original SGD algorithm works in this way The model parameters are initialized in the right way for example following some Gaussian distribution or uniform distribution and then we read the data in a mid-bash way, so The big data set will be partitioned into small mid-bashes and be Fet into the model. So the model will then the model that the model will compute the gradients of the parameters with respect to the loss function and Then once we have the gradients we can update the model parameters So this is the first order gradient So update the model parameters in that direction that will make the North next in the next make the North smaller and The training or the model the training algorithm We said that it is spent it spends a lot of time because the operation is inside the for example the convolution layers or In the MLP the Free collected or the dense layer. So there are many metrics large metrics multiplic multiplication So this modern metrics multiplication is time consuming. So finally the training for one iteration For the whole model. I mean the one SGD iteration is time consuming so here this figure shows one example for the for the Multinational perception so the input data will be fed through all the layers to compute the prediction and then backwards to compute the gradient of each parameter in Inside each layer and then this parameter the gradients will be used to update the model parameters So this is a general idea of the SGD algorithm for training different models Yeah, we said that training procedure is time consuming and there's here are some Statistics of some popular convolutional networks for image classification applications It has the training epoch means passing the scanning the whole data set once is called one apple Yeah, so typically it will all these models requires to pass this data set many times and the model parameters The set the size of the model parameters is not Okay, so so here's just some basics of training deep learning models next we will discuss some optimizations in terms of in terms of speed and efficiency for training deep learning models and that deep learning model Consists of a set of layers. So it really we will represent a model using a set of collecting layers, but during the training. There are more final green concepts This training consists of a lot of Small operations. These operations could be like this So for example the input x or some some middle middle layer features x would be passed through the sigmoid function X could be a matrix of vector and this x would be used them in different different top upper layers This is a matrix multiplication and some some linear Algebra operations. Yeah, this way. So one Opportunity of improving this the speed of these operations is to parallel parallelize them run them in parallel So for example in this example in this example here There are two paths to pass. So actually we can run them in parallel because there is This operation and this operation has no data dependency So all the once this operation is finished. We can run this one and this one in parallel Yeah, so this is The operation Scheduling so if we can schedule these operations with which have which have no data dependency In to run them in parallel then we may save time Yeah, so this our technique is used in by some existing systems like Seattle TensorFlow and Mxnet for for Seattle and TensorFlow there are two popular deep learning softwares they They they use they build the deep learning model statically in the case in the sense that the users will specify each layer and The mod and the click these layers and them this a software will generate a graph of a graph Inside this graph each node is operation like this one and This operation is a collect the based on the data dependency for them for this operation as data dependency on this operation So once this graph is constructed They can do this analysis of this graph to to do the operation scheduling To find some passes like this to run them in parallel Yeah, so another one this this this system Detect the data dependency at runtime. So So all the actually the operations will be submitted to all to a Coordinator or something called engine. So this engine will receive the operations and detect the dependencies On time at real time and then schedule these operations on to different execution units and We thought we think that there are some improvements for this operation scheduling so for example in the different models most of people would use GPUs to train deep learning models and Inside one GPU. There are many coulda streams or kernels. So If we can we need to locate or place these operations on to a multiple coulda streams, how to place these operations to maximize the throughput would be Problem and may have some opportunities for further improvement improve the efficiency Yeah, so as a round-top optimization can collect some statistics for example the the running time of each operation and Actually in database systems, we know that The SQL query Is highly optimized using some statistics and the cost models So we will estimate each Scheduling or estimate each execution time and then select the best one to do the real execution yeah For hardware yeah, so tip tip most what we use the hardware is GPU Especially for MVDA with a GPU because of the CVDN library, which speed up which can speed up the training The training of convolution layers and the other layers in about five to ten times as in compared with self-implemented Kindles and maybe more than 10 times faster than CPU implementation and recently some some people said said that the FGPA may have the potential to To speed up the training in the sense of power consumption So it would be more power efficient to run this at many models than GPU and The FGPA could be used in small devices the mobile devices and Intel also tried to speed up the machine learning algorithms using their own hardware and The open cell is a language program language. So open cell works on a wide range of hardware including MVDA GPU, AMD GPU, XC, Sion and the FPGA Yes, there are also some learning chips and instruction set So generally there are many People's working on hardware optimization, especially for deep learning models despite in addition of the operation scheduling hardware optimization Some people are working on distributed training so in in the sense of using a cluster of machines or multiple GPU clouds to to speed up the training So for this there are two types of Parallel reasons for training deep learning models Why is it called model parallelism and why is data parallelism and for for example for this model this is our rule version of RN model. So as we remember the RN model can be enrolled into a feed-forward neural networks So in this way here, we have two two stacks of RN layers This is the first this is this is the first stack and the second one So typically the training of the data the data or the four the four for the four units That would be for inputs So this four inputs can be fed in at the same time to the to the model But there are data dependency between these units the RN units So this one has to be done at first and then this one and this one this two can be done Can be run at the same time and after this to finish This two can run. So generally we can run this one this one in parallel This one this one in parallel this one this one in parallel So we can separate these two set of two stacks of layers into two workers Here one worker could be one GPO GPO Yeah, so But there would be data transferring between the workers if we partition the model into different workers and run them at the same time Yeah, we will discuss the overhead of the track of the communication later And this one is for the data parallelism. It is much easier that compared with the model parallelism So you just the partition the whole training that they said at the beginning in two small set and you create the neural network structure Advanced and you replicate this neural network structure onto multiple across or onto multiple nodes So it is called model replica So each model would each node or each worker would have the same set of model parameters so at every SGD iteration it will read its own partition of data set and compute the gradients on this data set I mean for each worker and this gradients will be sent back to Centralized primary server that it could we have multiple nodes here and this gradients could be would be average I've region and then it will be used to update the mod parameters The update the mod parameters would then be sent back to each worker So this is called a synchronous synchronous distributed training another version is called this asynchronous training for asynchronous training the Gradients would be sent to the server in the way. I mean separately So the server once the server receives one copy or one set of model parameter gradients It will update the parameters immediately. So each worker while the workers run asynchronously Again is that overhead from the communication and synchronization between the worker server and the between among the workers That I said you can Do partition partition the data set evening without Overlaping for example, you have one sound images you can partition them onto 10 machines. Each machine would have It would have 100 images Yeah, we mentioned that there are some overhead due to from distributed training The first one is a communication We we see here the model policing works for this model because this model has pattern a specific pattern that the This layers can be run in parallel, but for some models for example this model This model is this only one single pass. So it is This it is not like this one this one it has to pass So you can run them in parallel But for this one if you use use few partition the model into two parts For example, this is our original model and they you partition this layer Into ha to into two positions then there would be communication Like this So it is all connect wall. So there would be much communication overhead and for this one this the communication is The communication only happens here, but for this one This one this one there are four places and for data parallelism Yeah, so if the model science is very large it means every time the worker Will communicate will transfer a large set of data to the server. Similarly, the server The worker will transfer a larger set of gradients to the server and the server will send a large set of parameters yeah, so In terms of communication the communication challenges the optimization Techniques would include as a compression. So this is a direct solution. You compress the parameters for the data when you transfer this parameters for data and also the hardware the infleaband could be used and hybrid parallelism Is also used to the hybrid parallelism means that you use for some layers you will do the model Paralysm and for some layers you may do the data parallelism. So it can still Can serve some cost, but would be much more complex Yeah, and the synchronization Overhead is not a problem. So for the synchronous training If you have many Workers, so this work the server has to wait the grid is from all the servers So in this case the synchronization among all the servers would be a nice problem. So typically We would use two to three GPU cards on a single server and for asynchronous training. So Actually the workers run in asynchronous asynchronously, but Since they run different workers would have different running speed one worker would finish very fast and so Maybe worker A is on is working on the Five fifth iteration and the worker B is working on the 10th iteration So this this difference between the iterations is called a stillness So once the stillness is very large It means the difference the workers are working on totally different versions of the parameters So the training convergence would be a problem. Sometimes it may not even converge. So and it means that finally the model Diverge so the training diverge. So you cannot get a stable version of the model is also hybrid solution Some one is called the stillness bounded asynchronous training So typically once you found that once you find that some workers are very fast. It is five iterations Faster than other workers then you stop this worker and until other workers catch up Another one is called the training with backup workers. It means that So it is synchronous training But a lot For example for you have 100 workers once you the server receives The gradients from 90 workers then it will update the mod parameters So traditionally the server has to with all the gradients from all the all the workers It means 100 workers. So in this case once it receives the 90 Gradients from 90 workers, it will continue to update the mod parameters So you will not wait for all the workers Yeah, so next we discuss some techniques for the memory optimization So in our group there are many Students will work on the memory Memorated base. So the many techniques to be adaptive So the memory of training algorithm is used by the primary values primed the gradients and They layer feature values and then they are gradients so Typically it will be four times it would be four times of the primed size even larger and But the GPU has a smaller memory size than CPUs So what we can do to save the memory some some Traditional solutions to use memory pool or use garbage collection So for different models for is especially in maybe one or two years ago The software's will allocate the memory for each variable although for every iteration So for the feed forward models the The data flow would be from the bottom to the top So in the case some variables from the bottom layers would not be used in this iteration so it can be free but yeah, so the Some systems would would locate this would keep this Variables for the whole life cycle. So it is a list of memory So if you use garbage collection, for example in Python, Python has this kind of technique so it will free this memory automatically and The another technique is swiping So we know that GPU has a limited memory, but CPU has larger memory. So we can swap the memory The variables between GPU and CPU. So it is also very specific for For some for some specific models, especially for that very deep neural networks. So The variables of the bottom layer would be used in a non time after non time because we have to forward Probably to the top layers. So these bottom layers variables, for example, its parameters would be used in a non time I mean once it Propagate to the top and then back and comes back. So in this case the parameters Could of the bottom layers could be swapped to the CPU and when we want to use it and we swap it back to the GPU so this is called a swiping and Another technique is called dropping variables So this some variables because it's forward propagation and backward propagation during forward after For one layer after it's for the propagation We can drop its variables. Although it would be used in data. We can drop it So we can select some layers and drop their variables But we do not drop all the variables of all the layers We just select some some to drop and the later when we want to use it. We re-compute this variable it is next the Redo operation will feel recovery in database. So in the base you have the lot so you know how to compute the record You have the Transaction records so you know how to compute the record of one time point Based on the purpose checkpoint files or some other persistent storage. So you so in in this case So some variables are in GPU memory. So you can use these variables to compute some drop variables So this is called a dropping variable dropping So actually all this Can be done in runtime to I mean you can Because the iteration the SGD ration For every SGD ration the computation the pattern the data flow is the same So the order of the operations are the same So after a few iterations of only up one map iterations you know when this variable would be used again and Then you can estimate you can schedule the dropping or swiping operations in this way that I Can make this a subset of variables or an optimal subset of variables to drop to swipe And you can estimate when to swipe the variables from CPU to GPU Yeah, you can do some cost modeling to estimate the time the best time to drop swipe Yeah, swiping swipe up Yes Do you think when you work this in Python? With the benefit you get from doing what you just said Variables and so on if you were to instead take certain parts of the program Do those parts and see instead would be better to say Python is buying definition not very efficient in handling certain operations. That's why some parts of Python programs they convert it so like non-py for example, I know a lot of that is actually Converted so it's more closer to see So the is a problem about the Python the non-great problem Try to take Python as it is as an interpreted language and try to do some magic with the variables And efficiency and I was just thinking maybe that that's also what I would do But then I thought okay, but maybe it's better to to take those crucial parts and then have them Do that in more low-level language and then you use it as module I think these optimizations Very specific to the deep learning models. Okay. Yeah So yeah, you can if you want to use the penalty as a black box Then this operation this optimizations would be no level optimizations. The users will not need to care about this this seems Yeah, so because we are we are working when developing the software So we think this optimizations. We should do this optimizations for the users So the users will not would would would not need to care about this no-level optimizations Yeah, so we try to do this automatically. So users would not need to know the underlying Part the low-level optimizations. I think it is Like the garbage collection was on optimizations in Python So the Python will do this things automatically and the users will not Will not need to know how the memory is managed Right, right, but it just sounds very complex the way you explain it. So Then I thought if it's any way going to be complex many maybe I just had this better to just super-optimize And then you make the black box Yeah, yeah, yeah, I think for depends on The errors you are working on so you just want to use them learning them. You don't care about the Level optimizations. I mean you don't want to do this in low-level optimizations, right? We want to use the other black box So what languages? So seeing us basically implement this Yeah, yeah, yeah, yeah, yeah, we'll have the Python binding or Python Rappers Yeah, here is an overview or comparison of the existing systems We we do not care of every aspect of them, but just the optimizations. We mentioned in the previous slides including parallelized operations I mean the schedule operation scheduling Yeah, we know that actually tensorflow and the Seattle They have the ability to do this kind of optimization, but we don't know that they they haven't show the Details of whether they have done it or not We know that the mxnet has done this one Yeah And for torch cafe currently, I think it is really hard to do this kind of this low-level optimizations Because they are based on not the tensor level. I mean the operation is away I mean We will talk about the tensor for it operations. So coffee and torch are based on More like layer level or so touch has a tensor operations Yeah, the execution plan tooling the company now has done the memory swiping some of them have done Yeah, but was down manually. Yeah, so we we think there could be some opportunities to do it automatically Yeah, generally, I will mean that there are some opportunities for optimizing The deep learning training systems from this aspect Consistency we mean the synchronous or asynchronous Yeah, so here is brief progress of singer. Yeah, so Prep at the initial at the initial stage We we think that distributed training would be very important for this deep learning models because deep learning models use a lot of memory and CPU and GPU so we wait for the first version we 0.1 to 0.3 We focused on distributed training, but later we found that Not everyone would use the distributed training especially for some data scientists and some Some data science experts. So they just want to use deep learning as a black box. Yeah, so so for for the version with one point zero we redesigned some parts to optimize this since That has a black box and provide some easier User interface for users to use for a single GPU training. I'm a standard known training Yeah, we will. Yeah, so we'll do some details here So for the purpose version, so We have this neural net and the abstractions which are inherent to that learning models With ample we have we use this in the abstractions to create the feed-forward models And there are models so we can all rule them into a feed-forward model. Yeah, also for the energy models And then we can do Partitioning for distributed training different partitioning methods and distributed training different as Approaches synchrony asynchronous training and hybrid training we implemented in the previous version and We propose a flexible architecture to conduct different distributed training frameworks synchronous asynchronous and the popular training algorithms Here's some show gesture to some Experiments we did before on the CPU and GPU machines Generally purpose we should do that For small for some models not every model for some models if you use more machines or more GPUs or CPUs it would increase the speed Yeah, and this this this is one GPU But no not very good GPU. So if you use more CPUs you can beat it Yeah, this one we trained on a single load and in your cluster So here we noticed that the training was not very stable due to the stillness we mentioned before So if once you have more workers, so the stillness would be larger So that would affect the convergence and later we did some With some more experiments for using the GPU machine and GPU cluster and this figure shows that training on a single node with One to three GPU cards. So we see that our system is performs well and It is compatible with other systems for single node single node multiple GPU cards But for the distributed the training I mean in a GPU cluster. So we found the problem For this model the Alex lead model is a very large model has a large set of parameters So it means that communication overhead is very large So this is the performance for a single worker and Immediately to see a single node and this one for two nodes and the phone notes We can see that the performance is bad for multiple nodes. So this is caused by the Communication so we tried another model or we compare For the TensorFlow the case is the same is the same. I will try another model. This model has a very very little parameter set but its Computation is very high the computation cost is very high So the communication overhead is slow is small but the computation cost is high So in this case if we do distributed training that we would have more The training that the performance would be better node node means one machine Yeah, so this is one machine and this for machine They use that they have the same configuration for the CPU and GPU Number one question here means the number of GPU cards One GPU cards will one worker will run on one GPU card Yeah, one one here for this experiment way around one work one machine with one GPU card and he is the The software stack for the new version. So We tried to make it general or extensible to to support the optimizations Techniques were mentioned before so we have the Low-level abstractions including device and tensor. So for device it is on general as abstract Which would support different hardware including the CPU GPU FPGA from other and Other hardware. So how do how we implemented so we have the different specific devices the CPU device and the GPU device using CUDA programming language and Using open cell so open cell could be also be used for other devices besides GPU and for the tensor we implement different set of Different set different Math on any algebra operations for different devices So We can operate the tensors on different hardware devices in this way and on top of this core module components There are some other components that are specific for machine learning or deep learning models the layer for module for the deep learning models and initializer for Initialized machine learning the parameters in machine learning models and the last functions Metrics and optimizers and also this part of it would be the IO part it includes the data reading the read and writing and Network communication message transferring Yeah, so for this new version We try we want to support a general machine learning algorithms applications Through the tensor abstraction and we will try we try to support different hardware and Do some run to optimization for speed and memory? And also we will continue to support the distributed training from previous motions so the final one is Product while trying to build Try to make deep learning as a service that is say we try we train some models and but typically the cost of the companies or the users Actually want to use deep learning instead of the training money So they want to use the pre-trained models to in the applications So this service would it would the beauty in some default sound popular depending models So what the users would do is just to provide as the asset data We will Response to you with the prediction results. So it is just like a black box So we will manage the training and the other things So you just give me the data and I will respond to you with the result It is like API so we will provide the API or rest for API Yeah Yeah, you have them have that within the cloud service Cloud service. So yeah, so if we run the servers on Amazon or on the cloud, let me do the cloud services Yeah, but the thing is that when you do data transfers across to an API call The cost is huge especially be moving large amounts of data. So you kind of have to be within it's more Practical I think The data that for different models the training data set is very large, but At runtime, I mean after you deploy this model the data would come continuously So it will be smaller than the training data for example for image classification every time you may send me one or two images Right, that is a relatively smaller So you're selling the lent models not not without the training No, I mean for this one. Yeah for this one that you Scenario why is that I have trained a good image classification model? So you just send me the images you want to enable. Yeah, and that's case is that I Know how to train more image classification model, but I don't have the data You have the data. So I will train the model for your data Yeah, for for the second case. Yeah, so we have we will we may manage the data offline So you'll pass me the data in some way and that's in the model Yeah, so all you just upload the data onto S3 or Amazon storage Yeah, then then I will fetch the data Yeah, I can run it on Amazon is a new services and I can access S3 then I can do it This service the reason you want to do it like this is it because the training and the configuration It's actually an expert job. It needs you or someone similar to you to set it up Or is it because you want to keep the source code for all this closed or will it be open source or Oh, so actually the single code is open source. So for this part Yeah, we we think that not everyone want to tune hyper parameters. Yeah To debug the different models. So we want to make it easier and convenient for for for name and for Normal users Yeah, here are some applications. I will go through quickly. So the full-up project is a project in our group. So it will try to do food image Recognition to recognize the nutrition. I mean to recognize food as them to response users with the restrictions in facts inside this image by combining with Knowledge base with nutrition facts and the last one is for mail-based action Yeah, we will have some malware data set. So we'll do some classification. So definitely it is Used for it is a good at commutation and some Errors, so we're not sure whether it is a good for malware detection, but we just try by trying to do it and for health care data for some Disease progression modeling and something for the time series data And this one for some location in the spatial temporal data. We try to Predict the congestion of rules Yeah, some as I will skip So here's a conclusion. So definitely is effective but hard to train or optimize and So we discuss our optimization axis from other systems or listening systems and from our perspective from the database Perspective yeah, so we didn't actually we didn't talk about this one Yeah, so we're working on optimizing this speed memory. Yeah, and some other things To show that It's not that we're actually just some simple demo Yeah, we we build the so we build the core Corp in you see plus plus and for the model components model components the layer Initialize the optimizer something we will use the Python. We'll have the Python Banding and the wrappers. So here So we have the tensor abstract the device optimizer and this is in Python. So we can I will show some simple examples here. So this is first the example is very simple. You just try to find it will Here we will have 9 and around this now I will generate some data points I I Do deep learning Didn't do She So so yeah, some just some demos of using the Python interface to run singa functions So Firstly, I will Show a simple demo. So for this demo, I will generate some data and here this is a boundary there are two set of data just two labels and We will try to build a model to a simple model in the perception model to classify the data So what we did Here yes, so first of all, I will set the boundary and then I will generate the data for the That to do the set so So So one is called positive set another one is called negative set so I will generate this to a set just as the dots here This one this one and after that I will initialize the model the model has a linear transform Transformation layer so a linear transformation layer it has a width matrix So I will randomly initialize the width matrix and it has a both vector So I will also set it as the zero and Then I will call this train layer. So this nail will first it will convert the data The the the data each data point has x and y value so it is a two-dimension data and Here they are I think it has 30 dead points. So I'll convert it This this the ship of deity is 30 and two 30 rows and two columns And they will convert the the number array into a single tensor so this is for the data and this is the label and Here then they are the engine will set the the engine of the layer. So it means so Then for each layer, there are many different Implementation as we mentioned for CPU for GPU using code are using open sale. So we just set engine here So this one means we will use the cpp version and We will create one layer this layer will do the linear transformation It will multiply the input feature with them with metrics the parameter and Add the bios vector. So after that we will do all Nonlinear transformation it is called activation. So we use this more logistic Transformation function. So we set we construct this two layers and Then we so actually I initialize the parameters here So initialize the parameters following a Gaussian distribution and then initialize the bios vector to be zero and The optimizer the optimization algorithm SGD So I will set the Set up the optimization algorithm and the north function So all these are the necessary components for training for training a different model the optimizer not function the layers and Prematization initialization then I will do five. I will do five iterations So for each iteration I will forward the training data through each layer the dense layer The activation layer and then the north function the North so I will go I will get a lot of loss value and Then I will backward the gradients from the top to the bottom layers and Then I get the gradient for each layer and here this one is a gradient for the parameters in the distance layer Then I will apply the optimizer to update the parameters so this is the Prime the gradients and this is the primary value and it's just Tag for this front I will do the update and then I will here. I will show the plot So this is the original graph the original data points. So The different corner represents different enables. So here I will this is the first iteration So you see some some data points are misclassified Actually, this is the point should be red color. So it is misclassified as rule so yeah, so It's just a simple example You may you may need to tune up parameters to to make the data points are colored as this This is a ground choice. This is ground choice and this is a prediction of the model we train So you can tune the hyper parameters here Here that shows the loss values generally we want to train the models to make the loss smaller and smaller So you see this is the first iteration second one and it is it is crazy You want we can see there are many hyper parameters to tune how to initialize the Example how to initialize the model parameters for the Gaussian distribution or uniform distribution and The learning rate of the SGD algorithm means that the change update step of the When you are when you update the mod parameters Many have parameters to and how many iterations to run So this is a simple example Of us Currently we only use Multidimensional Yes You really So the way to K is a coefficient of the regularizer L to regularize One yeah, it is Another thing it is called regularizer so you can add a regularizer here So currently we use L to regularizer Implement L to yeah, you can implement. Yeah, I mean it is customizable Can provide another one This one yeah, this is one layer it is a built-in layer built-in layer so here it could be sigmoid Or tangent edge or I L you some others You have to implement a new layer This one This one is a another Another not different model. It is just This model I mean model was used in around 2006 after 2006 to initialize deep learning models So yeah, it is actually a channel model as to only two layers. So we will show here and this is not a Feed forward model and use another training algorithm to compute that the Gradients, so I will not show the data as just right Yeah, maybe I just Generally you will find the same Flow of setting there the layers Well, actually for this example, we will do Tense operations directly multiply and roll and this operations Yeah, so we'll have the optimizers and the loss functions and so on so I think This is a matter problem But that but I can show you here some some result. So this is so this is the With metrics the weight matrix in that layer. So this is a randomly initialized with metrics the first one and This is this after the first generation and this one see So generally it shows that the model isn't trying to learn in some patterns Through the with metrics you see the pattern would be clear after you write for more iterations Yeah, for this one, I will just show you how to create a convolutional neural network In a simple way to construct the neural net using pass action pattern is really good for these things First we use the C plus It is a very the code looks for me. So Currently it's much better This is here. We will add the layers. This is a feed-forward net as we did when we mentioned before So you will add the layers and here is all just a function So this is this one will add multiple layers So at the layer at the layer at the different layers and different layers will have different hyper parameters So just in this way This is our model way well already trained as our model this our model is trained Over all the Linux kind of code. So it is the model is on from other people So we just retrain it and show that way we can do it You was in pipe Singles password banding. So this is a train hope that after training we will Using this model to sample to generate the data. So we can see We'll write again So what we do what what we did is that so we want to generate 100 characters and We will provide a seeding character Seeding characters like this So because this model is trained over the Linux kind of code. So the data it will generate it will be like the Linux kind of code So here it is called a language model. So the unmodel with the Capture the sequential relationship. So actually we we inputted the the first four Five characters. So it will generate in this way Yeah, so the generated one So, yeah, so this one looks better So this one actually the first three nice from the grammar is correct So that's all for this one. Just want to show some Python code. Thank you Go on to the second Are you running this one? Sorry infrastructure infrastructure infrastructure all do you mean that the platform that the hardware? Yeah, is this cloud? Yeah, it's Amazon. We run on Amazon and collected through the Amazon So this is just so this is basically one of the Docorized Docker. Yeah We haven't prepared the Docker We just created this one yesterday. So we haven't published it Maybe later