 Is it on? Hi. Hi, guys. Oh, thank you. OK, thanks for the introduction. Hi, guys. Thanks for making it, although I think it's still pouring rain. OK, I think I would like to talk more about how you can generate, how you can create a Google App Engine project, which will support a backend to support high volume data coming to your database. A little bit about myself. I'm working with Property Guru right now as a senior software engineer. Before that, I was in Malaysia. I came here around three and a half years back. So far, I'm liking Singapore, of course. Before that, actually, I've been working with Google Cloud Platform for almost like four and a half years. Before that, I co-founded an app for MyRent in Malaysia. And we had a version of Singapore version as well called SGRent here. So there also we used Google Cloud Platform to get all the information, all the data, all the backend is basically there. So I did one of the very interesting projects, which was related to machine learning and creating some insights for people. For that, so whatever I'm going to talk about today is kind of a condensed version of it, what we did there. We built a scalable backend. What are the problems we faced? And how did we evaluate it? The technologies we're going to use, because they are whole new dimensions of technologies available on Google Cloud Platform. So which one you can choose? So this one, this study is basically will be quite useful for people who want to make some analytical kind of database, where they have to get a lot of data coming into their backend. And then from there, they want to churn out some insights and some analytics. So I'll just give a quick abstract there. So there are three scenarios, three things we have to do in our this case study. We have to collect some data, then feed this data to our backend, and at the same time, have a backup of it. So by collecting, I mean, so just think of a scenario like we have Google Analytics, where you will put a piece of code in your website, or you will have some plug-in for Android and iOS devices. And then with your app, then they will keep collecting data and sending it to your backend. So this data can be huge. And when we were working on this project, we were getting like, for some of the clients, we were getting data like 3, 4 TBs per day. And we were still starting. So then we were bumping into a lot of problems. So I'm going to talk about them. So we're going to collect data from multiple sources in different formats, from websites, iOS app, and Android. And then we're going to feed it to the database, some kind of database. And then at the same time, we'll make some backup of those files into some cheap database. Why call cheap? Because we might be spending a lot of money already in creating this entire architecture by using some technology like Bigtable, BigQuery, or so forth. So we want some backup, which is not so expensive, because we were still starting up. So this is the case study, store huge amount of data at high volume. And the requirements are that it should be easily scalable. There should be no data loss. So we should not lose any data. So this was the primary concern we had. And if you use any kind of technology, you might have some network failures or system failures. So that's why we need to design it in a way that it should not fail. And also allow faster means of retrieval and crunching numbers and analytics. And we quickly identified when we were studying about it that we will have some trade-off. We have to choose either high latency or data consistencies. Oh, sorry, low latency or data consistency. So we just settle for data consistency, because this is the most important thing for us. Can I quickly have a show of hand who has worked before with Google App Engine before, so that I can take my, OK, so I think still very few people? So I will go into more basic stuff as well as we go through. So these were the evaluated data stores on GCP. At that point in time, we evaluated that, OK, these are the main, that time Cloud Spanner was not there as far as I remember, or maybe it wasn't beta. So these were the four main storage options we had. Cloud SQL, Cloud Datastore, Bigtable, and BigQuery. I'll quickly discuss about their pros and cons. So for Cloud SQL, it supports this query language. It has primary and secondary indexes. It has relational integrity. It supports AC transactions. And also it provides a high available service to software as a service. And it allows you to pay without locking periods. So it works out to be quite cheap when you build a system. But at the same time, it comes with all performance limitations of any RDBMS like MySQL and Postgre. And it does not scale well with huge data volume. So these were definitely out option for us. Then Cloud Datastore. It does scales really gracefully from if you have a small application, then suddenly it keeps growing and growing. But in my personal experience, I realized that it's really good for building some websites or some apps or maybe some mobile apps. Because it allows you to have good read and write capabilities. But it still has some limitations. Like it was still optimized for smaller data sets. And we were looking for something which allows you to have even bigger data volume. So their Bigtable could have been quite good solution for us because at the same time provides you much lower latency than any other database is available, even much lower than some of the AWS solutions available. And it's low SQL, good for storing TB substructure data, great for high volume data. However, it comes for a cost. And you might not want to spend so much money for that because there's no scaling down options available here. You have to spin up at least minimum three nodes. So then finally, we bumped into BigQuery and we found out that this might be the thing for us because it allows you to have fast analytical database. And we needed some backend which we can run a lot of queries and turn out some insights from it. So there it helps. It can run queries on TVs of data in seconds. It allows you to stream data. So this streaming was a very good concept as well, where you don't have to create any batch jobs to insert data into your backend. You can just literally stream data as if you are streaming some video or there's the same way you do videos. So you can stream data into BigQuery quite easily. And also it was quite economic from long-term storage option because most expensive part of BigQuery is when you run queries. When you store data, it's quite cheap. But it has some consequences like smaller queries won't run as fast as you would expect. Like if you run select these number of columns from this table in a regular RDBMS, it will run like in millisecond if not microsecond. But in BigQuery, it might take still like a second or so. And querying data can be expensive. That was another consequences. I quite recently saw this flow chart. I think this can be quite useful for people who are just getting started. This can help you to make a quick decision that what kind of data I have and what kind of options I have then based on this flow chart, you can make a quick decision. Like in our case study, is our data structured? Yes. Is our workload analytics nature? Yes. Then I come to two options. Cloud Bigtable or BigQuery. So in our case, since I'm, so just if you think from Google Analytics perspective, like the scenario, you might have a lot of data in your, coming from the users, but you might not wanna have it available straight away, like without very low latency because you might be giving them insights or analytics after a few hours or maybe a few minutes. So in that case, we don't really require low latency. So for us, as far as data consistently comes into our system, that works. So BigQuery definitely is the winner in this case. It's economic, it provides great analytics for crunching data and we don't really need no latency in this case and supports high volume writes and which I will talk a bit later about it. So you can just quickly build up this kind of structure in your head that okay, I will have a lot of front applications and they will write data to my, they will call my Python backend and then it will stream insert into BigQuery. But if you build your system this way, you might have these limitations that it's won't be scalable because you will have some data losses because they can, if there are any errors and there's a single point of failure at BigQuery, if BigQuery, some problem happens there internally or some network problem happens, then you might lose some data and there is no backup mechanism in that case if you lose some data. So I came up with this slightly improved still naive version of my architecture where you would use task queues and at the same time, you will make a backup of your files as well on the Google Cloud Storage. Google Cloud Storage is quite cheap where you can store all your files and then task queues, I think of this architecture, if I have to choose one superstar, I think it would be task queue because it can really help to design scalable and scalable systems and because it's highly configurable and you can totally govern that how you want your data to go into the system and how fast you can control the velocity as well as how frequently your data should be processed which we will see shortly. So these are the benefits, quick, top three benefits I would say for task queues. It takes out all the heavy load work from outside from your work, from your request response life cycle. So let's say you have a lot of requests coming up and then you don't want your users to wait for it so you can just put everything into the queue and then Google App Engine will take care of this, all these heavy load work to be done at the background. And we will be using push queues, push queues are integrated in the App Engine's worker thread, what does that mean that you don't have to really do any fancy coding or anything for that, you just have to specify the configuration, specify a path and then Google App Engine will automatically take care of it and you don't have to pay for extra resource to use this task queues because it becomes part of your App Engine and it's configurable and it's scalable. So before you go into actual coding these are the some things you need to do when you are working on Google App Engine, Google Cloud Platform, you need to create a project first and then you're going to enable billing and you will get access to your service account. By that I mean that you can get access to all the services available on a Google Cloud Platform. So you can just get the private key and then convert it into PM format because for the code I'm going to show you guys later, it works with PM key. And then just place your key in the two directory or any other directory you guys might wanna like, then you just install all the dependencies and you can download this app for your Windows and Mac. I think it's available for Linux as well. It's called Google App Engine Launcher which quite easily you can browse the project and then just configure reports and you're pretty much done to rock and roll. So this is the basic structure of every application, Google App Engine application where you will have a app.yml file which specify all the configuration that what is the name of the application, what version it is and all the other configurations like libraries and all. And then you have Chrome.yml file which specify any scheduled tasks you want to run every hour, minute, or day, or so forth. And then Q is the configuration file for the task queue. So this is basically the only configuration you will be doing for a task queue. You will put some piece of code to specify that how frequently you want to run your task and at which velocity and that's all you have to do for the task queue. And then your source code basically or your entry point to your application. And then requirement.txt file will keep all the libraries you might need. So I will just skim through all these .yml files where, so all these comments are going to specify what exactly they want they are doing. I think the most interesting part here is how the automatic scaling works, how the maximum... So in automatic scaling you can specify how many number of idle instances you want. So by that it means that maximum number of idle instances for this version. So in Google App Engine you will have app versions. Like in this we have app version one. So my configuration only allows it to have maximum one idle instance. Just for, because you might wanna play with how many versions you want to create based on your billing needs. And then maximum pending latency tells you maximum time to wait before starting a new instance. Because Google App Engine application automatically scales as the need arises. So these two configuration options can help you to choose how much, how frequent you want to expand your Google App Engine. Google App Engine application. And then let's start like handlers how your, what is your starting point of the application in the libraries you need to use for your application. Then similarly we have Chrome.yml which is just simple configuration for your schedule task. Like in this case we have hourly backup to store some info to run this URL, backup URL which will run every 60 minutes from 9 to 8 a.m. 8 p.m. 8 a.m. actually. Which means that it will run every hour at the clock hour like 10. So like if we see now the time is 7.16 so it will run at 8 p.m. And this is the queue file. I will spend a bit more time on task queues because I feel that this is quite important. And it took me quite a while to understand exactly how I can really leverage on it. And it's all about the details here which you can understand. So this is the queue information configuration. Every Google App Engine platform has one default push queue and I'm going to just use that. You can always name, you can always make new queues like and name it the way you want. So I'm just going to use the default queue here. And then I'll specify the rate which is 500 per second. The bucket size which is 100 and the retry parameters. In the perfect world we will assume that every time a task come from the queue to the any database it will always work. But then there are some cases where things won't work the way you imagine. So that's why you will put some safety nets like this retry limit is a safety net that if in any case the task fails then I will be keep retrying it. So I just put 50 here as any random number and you can just always configure it the way you want. So what is this rate? Rate is how often tasks are processed on this queue. So it's like 500 means it can process 500 tasks per second. And then bucket size is how maximum amount of task in a bucket. So we will have a bucket which can get 100 tasks and every 100, one by 500th of the second we'll execute a task. Because so how it works on a token bucket algorithm where you will get a token and then only you can proceed with the task. I will try to explain it with this timeline diagram which I took from one of the videos from Google Developers. So let's say we have a rate of one per second and the bucket size is five. Here let's say we suddenly we have seven tasks we want to run. Now our bucket size is five. So the first five tasks will run straight away and then since our rate is one per second so we have to wait for one second more to get the execution token. The green color dot is the execution token. So now at the one second end I get the execution token then I'm going to take the seventh task, the sixth task and then I have to wait for one more second then I will get the seventh task. Similarly if my rate is two per second I will be getting my execution token half the time before. So I will get my execution token two per second. So if you notice at the half of the second I get one token and I can execute again from my bucket and then I'll get it again at the end of the one second. So using this you can configure your application in a way that how frequent your data needs are and then based on that you can tweak your performance of the application. Okay, enough talks. Let's look at the code. So this is a sample application I created. We will have this app dot ml, chrome file in the task queue is exactly the one which I have put on the, can you guys see the code? Zoom in, okay. I think this doesn't allow me to zoom for some reason. I think I'm using old version of PyCharm. It doesn't, just give me a second I'll quickly change it. Can you guys see it now? Okay, cool. So this is the queue file which we just discussed about it. And then this is the main project file, main Python file. So you will import all the, all the libraries you need. And then since we have used this, we are using this requirement file where I specify something like you would do in Node.js, it's just application where you specify all the dependencies you need. And then while you are running a command and then you can run this command to pip install and into a directory. So I installed all my dependencies in the library folder. And then I'll just append to the system path. And then I'll have access to all these libraries which I need. Next I have specified some retry parameters for Google Cloud Storage. If you are writing some files to Google Cloud Storage this is quite important to specify retry parameters because for the same reason sometimes file writing because the input output, it might fail. Then you can use these retry parameter to retry if in case something fails. Then your project information, your service account email. So whenever you create a project in Google Cloud platform you will be, you can create authorization email which you can just put it here. And then the project ID, data set. You have to manually create a data set in BigQuery. I have not tried how I can create it. I think there should be a way to create it using BigQuery library. And then in my, in our case we have just defined data set test. And then this is the schema file. Which kind of schema you want to store on the BigQuery. So I'll quickly open this file. So in our case we just want to store this kind of information. Let's say session ID, which is type of string and it's a required parameter visitor ID string. And then some hierarchical data like document which stores title, then URL, which is also another hierarchy of a record type. So in BigQuery you specify this type. Then based on that it will automatically create the table for you. So this is pretty much our JSON looks like. Very simple. I specify it here. And then my private key which I got from my service account and I place it here. So this is the main collect function. So this is the first part of our application which collects the data from some kind of application. And then pass, post it to this API. Where I get all the information from there like account ID. So in this application just creates a table for one account. So just think of this way that you have installed Google Analytics in your website. And you are given an ID like X, Y, Z. Then I want to create a table for that X, Y, Z and dump all my information. Keep streaming the information into that. And then somewhere my data centers will be churning out all the insights from that data. So I have these two information. Then I just encode it into JSON. And I immediately write it into a C string object. So C string is a Python's C implemented library to write string buffers. It's quite high performing. And I tested it with quite a lot of data. And actually it worked really well. But you can always use some other libraries if you want. Then if I have account ID, if I have the account ID for that I will just create, I'll get from my private key. I'm going to get credentials to BigQuery using this sign as a GWT assertion function. And once I have access to with my credentials then I can get the client which is the object of BigQuery type based on my project ID. And then I'm going to do a quick sanity check if the table exists or not. Because BigQuery will, if you want to insert something and there's no table which you're looking for then it will just gracefully fail and you won't know about that. You can only check on your strike driver tool if you want. So I will just do a quick check if table exists. Then I'm going, if table does not exist then I'm going to create a table using the client function by specifying in the under which data set. I'll go through BigQuery a bit later and show you guys how the structure looks like. Then I'm going to create a table in under that data set I've specified. And then I just print some information for myself. And then I'll just take this JSON data and going to, this is just a response just for me to test. And then I'm going to add the task, add the task into the queue with the data I have. So this is the data and the task queue I add this. So task queue you just add a URL there and then you specify some additional parameters like I want to pass like JSON data, data set and the table name. So let's look into what this worker route does. So this is the most interesting part of it. So I just get all the information, the data, data set name and the table name. And then I again have to use my credentials to get the credentials. And then I authorize my credentials and then I get BigQuery object and then I pass it to my in this function called stream row to BigQuery. So this is this function what it does is it creates a row, row object with insert ID, row will be the data and ignore unknown. So this is some default, some unknown, some extra parameters if you want to ignore unknown values or not. So here what I did is I just put, I just create an insert ID with a unique ID so that just in case if something fails you can the data is not duplicated because if you don't specify that data will be duplicated. Now in this case you have specified the insert ID which will be unique every time. If it fails BigQuery will automatically won't duplicate the data for you. And then just use this BigQuery's function called insert all which is used to do the streaming of the data. Once you do that, so that's all we will, our backend will have to handle. So it will get the information, get the back it up. So sorry, it will just store into, it will dump into the output file, output object and then create a table if required and then just pass to the queue and then just forward the response. So our backend do not to do any heavy processing, it just takes a request and then just process it and put it in the queue. That way all the heavy task is just handled by the queue and this is executed based on Google App Engine's thread workers. So now we haven't talk about how the backup part is done. Since I've been writing all the information into this C string object on the Chrome job which I will run every one hour I run this backup route. So here I will just create file names using the date time and then I'll just use the default bucket, I'm using default bucket here and then I'll just write the last log file and dump the output object into it and then just close it and renew the output object so that it can write logs for the next hour. Right, so let's just try to run this program. So I created a quick shell script which will just run this curl command as many as time you specify based on this for loop. So I'll just quickly run multiple objects and then we see if the data is coming to our big query and a backup is created or not. So I'll just say that to run this command for let's say 100 times in the another tab, I'm going to run it again let's say 50 more times. So we're just returning the response output to me. At the same time, I'll run more, a few more times. Crazy, 20 more times. So now what is gonna happen is gonna send, right now it's looking a bit slow because I'm running as a curl command but if you run from a JavaScript or any other asynchronous means then it will be quite instantaneously because you won't have to wait for the response. My curl is waiting for the response that's why it's taking some time. So now when I go to my Google App Engine I can just go to task queue. So right now my default task queue has tasks running, 17 completed last minute, one task in the queue. So how a task look like, I think it's a very simple. Okay, I can always see. So we have a Stackdriver logging which shows you what was done and you can see the response was 200 so everything should be fine there. Now I'll just go to my BigQuery and then we can see if the data is there or not. Query, yeah. Okay, I'll just run this quick query. So if you use star in BigQuery, it will ask, it will tell you that this will be, this will be charged quite a lot so I always prefer to use some, I just always select wisely. So I'll just run this query. So I have all these data here. We can even do timestamp is here. So this is the timestamp. I think I can just order by timestamp. Let's do it descending. So if you notice that the time is, oh, I think I know why the timestamp is because we are adding timestamp hardcodedly here. So that's why the timestamp is showing the same but as you notice the number of columns, number of rows are increasing. So all our entire backend is working as we are expecting it. And at the same time, now our task queue will run at 8 p.m. But we can always force it to run right now. So what I'm gonna do, I'm gonna just open that route, the backup route, and it should backup for me manually. Okay, it should have worked. I'll go to cloud storage. We are using default bucket. Default bucket is generally the same as projectname.appspot.com. So this is 1932. So this is the file which we use as the backup. So it just backed up everything here. Right, so this is what we expect our backend to do and everything looks fine. Right, so I have post. I'm not sure. I think I will be sharing these slides somehow later on. And then these are some of the references which I think is quite good to get started with. And I hope this was quite useful for you guys. That's pretty much for me unless you guys have some questions. Thank you. Yes. Yes, yes. It's basically whatever you think it is. Yes, it is. It is. Yeah, so where exactly does it reside? Does it reside in the RAM? It's reside in the RAM, yes, that's why. Have you ever had it overflow or kill your instance? I did not have it because the RAM would be specified for it quite high. And then for this project, I made it one hour but that time what I did was every 15 minutes is going to renew it. So another way you can do it, I think it's a very good question because another way what you can do is you can do it based on the size. So if size of the buffer increases by this X amount, you can do that. I think there is this library in Python which is called, I don't even have the exact name. I'll find out in the show later on which can allow you to write logs based on time, timely basis or size basis. So you can change the algorithm to be and make it size-based. So this work, sorry. Yeah, sorry. Do your app engines ever die without warning? Like in some mode of some server somewhere. So the reason I ask is if it's gone, then the backup is gone for that one hour. So is that data corruption an important problem or not? It is important, obviously. Generally how it happens in app engines once your node dies, it spin ups the other one because you specify the automatic steering mode. So if one instance dies, it will automatically spin up one or another instance. But it might still happen that for some second when this is switching, it might still happen. So I think for that you need- The RAM won't be carried over. Right. Yeah, RAM object won't be carried over. That's true. That's the only, I think that's our limitation of this. Okay. I mean, I was just wondering if there's any sort of Google product that allows streaming storage just like what you did with streaming the data into the database. If there's some way to stream file information into that way you won't need to maintain a RAM storage. If you happen to know many such things. I don't really- Okay. Thanks. Unless I think you might have it right. Hi everyone, I'm Devin. I'm actually on the Google Cloud team. I'm a typical guy, based here in Singapore. But yeah, the question is about RAM, right? Could you run out of resources on a computer and what other alternate Google products meant to be? One product we have is called Dataflow. What Dataflow is, it's a different design pattern. I think that Rahul was mentioning here, but it's a concept of you basically just build a pipeline script, just code, which lives in the cloud. And as your inputs come in and as the inputs increase, say a message in queue where you have millions of queries per second pumping in and then you burst, right? You can have an elastic compute infrastructure spin up. So Google will actually procure VMs and storage for you so that your pipeline is elastic. So you can really build a global scale streaming stack from the ground up and that's called Dataflow. Yeah, thanks. Devin, so how would it work with my application itself? So does it work with the Google App Engine itself or it's a separate Dataflow? Because I've never worked with it before this curious. Yeah, yeah, good question. So it's a different design pattern. So with Dataflow, you're really building a batch and a stream pipeline and you're building, let's do it a different way. So Google actually created Dataflow. We built it in-house, it was a product we have in-house called Dremel and some other tools. We open sourced it as Apache Beam. So if you're familiar with Java and Beam, that's what Dataflow is. So thanks for the question, yeah, it's a bit different. You would build a Java P-collect stack, which is this kind of a ware object that will organize entries and then build up kind of build up P-collections on the fly. But the compute is behind the scenes. So you don't worry about, you don't have to care about VMs at school. You can just focus on your code, focus on those pipelines, and then focus on the fun part, don't make it into a database at scale. Then do the really fun parts, building dashboards, machine intelligence, you know, all the exotic stuff. Don't worry about the pipelines. Okay, all right, I'm gonna check it out later. Any other questions, guys? Any last questions? I have stickers. Okay, thank you, Ramu. Oh, yeah. So how does the task queue actually help to scale up your infrastructure? Is it because you can help to control the rate limit and the bus size limit? Burstiness, you mean like, sorry, do you mean burstiness? No, I mean, how can the task queue help to scale up your infrastructure? So that the overhead from the main application is being transferred and it can be delayed. So you don't have to immediately process all the requests. It's like, you know, when you have, so when you, suppose you are working on something like in a bank and you have a really long queue and a lot of people are coming. What you would do is that some part of your processing, you will allow it to do by workable threads which are provided by App Engine. And at the same time, you can control how they're going to behave. So that's the basic concept, why task queue will be helping you to build skill application. Because if a lot of requests are coming, you don't have to worry about how you're going to handle it because you can always put all the high load into the task queue and you can just serve the request and quickly respond to the client. Thanks. Okay, welcome. I just want to ask a question that's not data but related to that gentleman who explained about data flow. But I'm not using the data flow. I'm using another concept whereby I use a pass up trigger. The pass up trigger to trigger when data comes to the pass up trigger cloud function. I don't want to know how reliable is this mechanism. Will every message get possessed by the cloud function? Sure, good question. And maybe it's also really the other gentleman's question. So maybe to Revolve's point, the whole concept of, I guess, task queues is really this bigger concept called messaging queues. And it's this idea that the old way of building stack and scale was to just buy machines, buy a bunch of bare metal, buy a bunch of VMs. You spin them all up. They generate a lot of data. You have them all dump into a database all independently, right? That worked until, that was probably the way you built things up until maybe two or three years ago, right? Now what happened is, now that everyone's on the internet, right? More and more people getting online, hitting your services, you now have to decouple. The idea of decoupling your VMs because they die, right? They break. You can't really rely on them and the RAM inside the VM to really be there when something comes in to get it out. So that's, this whole design pattern is basically building messaging queues where your VMs can die and it's okay because as long as they get that message out, that message will land on a data plane, almost like an electrical bus bar, right? Once the packet gets on there, then that packet will eventually make it to the database and get stored. So the key is just get the data out there, let it run on that system and then land in your database. So that's what's cool here. That's a big part of this talk is that this is the new way you build things. You need, if you wanna hit scale, the concepts of messaging queues are helpful, right? Then maybe your question about PubSug, for those who don't know, PubSug is almost like a messaging queue, right? That Google provides. So Google search, Gmail, any Google product you use, you're running on top of PubSug. That's the service we use ourselves. It's also available as a cloud product. So if you go to Africa, so we're using that. Now, what they do is, your question is about, you're already building on PubSug, but you wanna make sure that a message hits PubSug and then lands in its destination at the right time. I think PubSug guarantees different patterns, one being at least processed at least once, I believe. So there's different concepts of streaming and you can, through PubSug, we can kind of guarantee different ways if it's always once delivery or uptime. But the general idea you can build around is that as long as you're integrating PubSug in your library, as long as you get the message out, PubSug will provide that guarantee that your message will eventually hit that message queue and then eventually arrive at whatever services you have listening on the other end of that message queue, right? So you're sending stuff into PubSug, PubSug globally will get your messages around and then drop it off into whatever services you have listening on the other end. Long answer, but with that answer, over 100% cover the cloud function. I can talk to you after, but Cloud Functions and PubSug, Google Cloud Functions, Google Cloud PubSug, do integrate. And so there is a high reliability between the two. So you can build a function that uses PubSug to then call another service, Google Cloud Storage, Auto and Machine Learning, some of the other stuff we have. Okay, thank you Rahm, thank you Devan.