 I'm with the data science and machine collecting and connecting the Google Thailand. And I am Tufam. Can you turn on the mic? Okay. I'm Tufam, Google graduates, but on the cloud platform and the city of Equate. So we are an internet company working on online advertising. And we have the CPI, CPS, and CPI. We work with the Ladada, Baidu, VHG, and Combo. We have a lot of partners. So the first time we work as an advertising and affiliate company, but then suddenly we have the data. We have the question of how we can monetize the data. So we try to sell all the data on the data warehouse and find out how to bring more businesses. Like the e-doctor, like the dyno, that's the company that buy all our data. So with the data, we can run the platform for recruiting, social credit scoring, or healthcare. Clearly we run in five countries and then... I think it's like 200 terabytes right now. And with five terabytes, new raw data every day. Come from the telco, come from the mobile traffic, the internet traffic, and the social network. And the next big thing we bring to the internet is credit scoring. You may have heard about the web. They build the credit scoring for everyone or every customer using web, like web taxi. So that's what we bring. At the first one, we have the big data and with the data mining algorithm, we can use the identified and with the social user modeling. So with all the data we saw in the CIM and it's always through the advertising network and every kind of network and partner. This is a sample of my profile. So this profile, my profile is mining from the... Only from my social... It's my first. And number one, it's a political and socializing. It's only generated from my social... What I'm talking about in the internet, what my check-in, what my friend, who my friend. And here is what I love to do in my free time. So with the ranking, you can see the star, how much money I have with the star of the hotel, the coffee. And with the credit scoring, you can... You may want to... You can see that. This is a pet. You can have all the information of what you want to do if you are... If you were a bank and you have a loan application. It's fraud or not. What kind of information, what kind of data you want to do, you want to know to decide the loan application. You can see the ad, the gender, number of days, marriage status, number of subjections, number of posts, number of friends. Every kind of data may work with your data. And you can mix with the offline data. So with the offline data comes from the some kind of data you see. You can mix together the online data, the offline data to bring the scoring for every customer you want. So this is a slide that... Can you... Who familiar with the AI? Or machine learning? Only few. So I will play this video to explain. From detecting skin cancer to storing cucumbers to detecting escalators you need a repair. Machine learning has granted computers systems entirely new abilities. But how does it really work under the hood? Let's walk through a basic example. We use it as an excuse to talk about the process of getting answers from your data using machine learning. Welcome to Cloud AI Adventures. My name is Yufeng Guo. On this show we will explore the art, science, and tools of machine learning. Let's pretend that we've been asked to create a system. The answer is a question of whether to create this line or view. This question-answering system that we built is called the model. And this model is created via a process called training. Machine learning, the goal of training is to create an actor model that answers our questions correctly most of the time. But in order to train a model we need to collect data to train it. This is where we will be given. Our data will be collected from glasses of wine indeed. There are many aspects of the drinks that we can collect here. Everything from the phone to the shape of the glass. But for our purposes we will just take two simple ones. We call it as a wavelength of light and the alcohol content as a percentage. The hope is that we can split our two types of drinks along these two factors alone. We'll call these our features from now on, color and alcohol. The first step to our process will be to run out to the low grocery store by having one of your different drinks and get some equipment to do our measurements. A spectrometer for measuring the color and a hydrometer to measure the alcohol content. It appears that our grocery store has an electronics hardware section as well. Once our equipment and the foods we have are all set up it's time for our first real step to the machinery, gathering that data. This step is very important because the quality and quantity of data that you gather will directly determine how good your prediction model can be. In this case, the data we collect will be the color and alcohol content of each drink. This will yield us a table of color, alcohol content and whether it's beer or wine. This will be our training data. So, a few hours of measurements later we've gathered our training data and had a few great steps. And now it's time for our next step of our mission. Data cooperation. Where we load our data into a suitable place and prepare for use in our machinery training. We'll first put all our data together. Then, randomize the order. We would want the order of our data to affect how we learn since that's not part of the determinant whether it's beer or wine. In other words, we want to make the determination of what it is independent of what's going to come before or after it in the sequence. This is also a good time to do any pertinent visualizations of your data helping you see if there's any relevant relationships between different variables. As well as show you if there are any data imbalances. For instance, if we collected way more data once about beer than wine the model we train will be highly biased or guessing that virtually everything since it would be worth most of the time. However, in the real world the model may see beer and wine in equal which would mean that it would be guessing beer or wine half the time. We'll also need to split the data in two parts. The first part, using the training model will be the majority of our data set. The second part will be for evaluating our training model's performance. We don't want to use the same data that the model is training for evaluation since then it would just be to memorize the questions just as you wouldn't want to use the questions from your math form on math today. Sometimes the data we collected needs other points of adjusting and manipulation. Things like duplication, normalization, error direction and others. These will all happen at the data preparation set. In our case, we don't have any further data preparation needs, so let's move on forward. The next step in our workflow is choosing a model that will remain models that researchers and data scientists have created over the years. Some are very well suited for image sharing, others for sequences such as text or music, some for numerical data and others for text based data. In our case, we have just two features color and alcohol percentage. We can use a small layer model which is a fairly simple one that I don't get a job at. Now we move on to what is often considered the bulk of machine learning but changing. In this step, we will use data to incrementally improve our model's ability to predict whether a given drink is wine or beer. In some ways, this is similar to someone first learning to drive. At first, we don't know about any of the pedals, knobs and switches work or when they should be pressed or used. However, after lots of practice and correcting for the mistakes, a licensed driver emerges. Moreover, after a year of driving we become quite adept at driving. We act of driving and reacting to real-world data as adapted to drive abilities. We will do this on a much smaller scale with address. In particular, the point of our straight line is y equals nx plus b where x is the input and this is slope of the line b is the line set and y is the value of the line at that position x. The values we have available to us or trained are just n and b where the n is that slope and b is the y-intercept. There is no other way to accept the position of the line since the only other variables are x and y are output. In machine learning, there are many times since there may be many features. The collection of these values is usually formed into matrix that is denoted to w for the weight stages. And that is called the biases. The training process involves initializing some random values for w and b and attempting to predict the outputs with those values. As you might imagine, it does pretty poorly at first. But we can compare our models with the output that it should have produced and adjust the values in w and b so that we will have more accurate predictions on the next time around. So this process then repeats each iteration or cycle in the weights and biases is called one training set. So let's look at what that means when we first start the training it's like we drew a random line through the data. Then, as each set of the training progresses the line moves step by step closer to the ideal separation of the line here. Once training is complete it's time to see if the model is any good. Using evaluation, this is where that data set that we set aside earlier comes into play. The evaluation allows us to test our model against data that has never been used for training. This method allows us to see how the model may perform against data that has not yet seen. This is meant to be representative of how the model may perform in the real world. A good rule of thumb I use for a training valuation split is somewhere in the order of 80, 20, or 70, 30. Much of this depends on the size of the original source data set. If you have a lot of data perhaps you don't need a fraction for the evaluation set. Once you've done evaluation it's also that you want to see if you can further improve your training anyway. We can do this by tuning some of our parameters. There were a few that we implicitly assumed when we did our training. Now is the time to go back and test those assumptions. Try the values. One example of a parameter in two training sets during training we can actually show the data multiple times. So by doing that we will potentially lead to higher accuracy. Another parameter is the learning rate. This defines how far we shift the line during each set based on the information from the previous trains. These values all play a role in how accurate our model can become and how long the training takes. For more complex models initial conditions can play a significant role as well in determining the outcome of training. Differences can be seen depending on whether the result of training with values initialized to zeros versus some distribution of values and what that distribution is. As you can see there are many considerations at this stage of training and it's important that you define what makes a model good enough for you. Otherwise we might find ourselves tweaking parameters for a variable time. Now these parameters are typically referred to as hyper parameters. The adjustment of tuning of these hyper parameters still remains a bit more in March than in science and it's an experimental process that heavily depends on the specifics of your data set, model and training process. Once you're happy with your training and hyper parameters, guided by the data which is set, it's finally time to use your model to do something useful. Machine learning is using data to answer questions. So, for patient or inference, is that set where we finally get to answer some questions? This is the point of all this work where the value of machine learning is realized. We can finally use our model to predict whether given drink is wine or beer, given its color and alcohol percentage. The power of machine learning is that we were able to determine this and how to differentiate between wine and beer to use our model rather than using human judgment and manual rules. You can extract what the idea is presented today to other components as well with the same principles as well. Gather data, preparing that data. Choosing a model, training it and evaluating it. Doing a hyper parameter training and finally, prediction. If you're looking for more ways to play with training and parameters, check out the TensorFlow paper. It's completely browser-based machine learning sandbox. Brief? 377 machine learning. So, you may want to ask a question. If you and your competitor have the same kind of data, have the same number of data, what is different is the productivity of your data science team. So, how do you bring the productivity of your data science team? So, you must have a workflow. You must have a work process. Like three years ago, when we have small data and a complex workflow, we have a lot of people to do the management job like copy, paste, bring data. We think very easy way that puts the data in the kind of storage that our employee in. They feel easily, they feel comfortable with the kind of data. So, they put the my data in the CFP, in the in the Mongo in every kind of data storage. So, after that, here is our common task. It tries the data a lot of data while how and brings the value. But no one can bring the real value when they do a lot of ETL job that is the problem. So, we try to bring the much more process but nothing work. So, we try to redesign our flow. Here is the very first flow of our team. Remember where I save the data. It's important. Sometimes I forgot the web save the data. And use the tool, it tries the data comes from the SQL, Spark, Pandas, everything. Then download the data. We call it some security purpose. I must download the data and draw some field and send it to the data sign team with them for train the model and redeploy it. If something wrong, might just step one. That is the problem. Our productivity of data sign team very low. Our productivity, the accuracy of our model is low too. So, I realize that the maturing and processing is only the small work in the central of this picture. You need the configuration system, you need the data collection, you need the future section, you need process, you need server, you need network, you need infrastructure, monitoring, everything to bring the MVP product with the data mining, with the machining to the product competitor. So, this is what you know that when you realize that something wrong, send up and think what wrong and what you can do automatically, not manually. So, at the first time, it might slow down your company, slow down your company, but maybe in few months, it's much faster. It's much faster. And it's dry, don't repeat yourself principle. So, here is our principle to design the workflow. Simple, don't repeat yourself single-rate possibility to scale out and to keep the cost low. If you have the complex flow and you cannot split the model, you cannot keep the cost low because you must scale out everything. Here is our flow. So, at the first, we have the load balancing and then collect the raw data into the compu engine. And the compu engine converts the raw file to the packet file and upload to the cloud storage. So, that's the phrase. Our data science team can use the data lab read data from the data cloud storage and use the training model, training models and then redeploy it again. So, no one needs to have the kind of manual job again. It's the full flow of my company. So, the key of this flow is only one data warehouse. We use the cloud storage. Only the cloud storage. Here is HBIN. So, at the first, the GCP the GCP and the cloud storage and data lab and the maturity engine. You can see the list. So, I will explain step by step why we use this model why we use this kind of technology. Why not another. With load balancing and compu engine. So, what I love about the compu engine is the high performance and low cost with fast networking. You can see this feature about 100 capital network in the Google network. So, because Google have the kind of private network. So, you have very fast and reliable network. When you put your data in a lot of gerions like I have two three gerions in one in Singapore and one in US red. So, another machine can pull the data from both the Singapore and the US red to compu the data. Then we we convert the raw data to package because there is high performance query and the cell depraving. You can see it's like comparison between the CFP and the lechan. Of course it's much better. The key of this the key of performance, the key of everything is the how the engine saw the data. So, it's column so it's saw in column not the row. So, it can select what column is one like in the S-square, in the S-square, in the adibemance, if you want to select one cell data you must read to the row that saw the data then it check the cell. But in the column you only select the row, the column that saw the data only. It's much more smaller, much more faster with the row sorry you must read until to the block that saw all the information of the row. But in column you can select one column if you want to select add you get add only. It's much more smaller than this thing how we upload it to the cloud sorry because it's fast, it's cheap and it has solid class that you can see here they have four class that regional, multi-regional like near line and co-line. So, it's for the fast and for the low data like some kind of data like archive data you only need near line or co-line data for backup data but for data for serving for customer every day you need to put it into the multi-regional or regional data it's much more expensive a lot but it's fast the key of saving is to know understand the model understand the kind of service and to what we want as a company we separate the data into multiple packet and then then we store in the year split in every year then every month so we can select it easily and cheap so with data lab you can you can write start from the iPython and you can write a Python code and run and pull the data from the class or and run every kind of like scikit-learn in project it run in the computer engine too after you get the zero you can try to work you can choose the TensorFlow or scikit-learn to work with the machine learning engine to get the scale level of this then here it's too tight of machine learning and you have if you want you have your old data you want to use TensorFlow you want to use the cloud machine learning engine or you want to use the services like cloud like image like natural language processing you want to use the right side if you have the left side then here's my tip the first time you create your system principle it's much more important than coding than everything else then design the system architecture and data flow and data model and data shock first then separate the the real time and the batch job and get the cost by network instant cost by read the dashboard every day and get the metric and get the alert when something wrong a lot of company don't have the data driven culture that's the problem and sometimes they ask about why my cost so high it's not high if you if you need if you care about your data you talk about the data but you don't care about the data the platform you running that's the problem of the high cost so that is thank you