 So he's going to be talking about BigQuery with Google Cloud Platform and how you can make it easier to get extreme analytics from it, so yeah, I'm going to let him take it away and talk about it. Okay, thank you everyone. I'm King Chup in Vietnamese architecture and for the previous year I've been working as a backend developer mostly on analytics and API development and I'm also a member of Open2VN, an open-source community in Vietnam where people are sharing their knowledge, their passion about open-source software. So it has been a long time since my last talk, so I'm a little bit nervous right now, but I hope I can give you my... So what about this presentation? For the last few days, you might hear about Google BigQuery and what BigQuery can do and the people also do data like blockchain on the BigQuery or analytics on the GitHub so-called into BigQuery, but in this case, I want to share my experience about using BigQuery for building a near real-time analytics system for my company and actually BigQuery is not a database, it's just a data warehouse and it's not also open source. You know, this is a commercial product from Google, but with my first spirit I want to share it's all like open, no-blessed, okay? First, I want to share a little bit about my experience working with Google Datastore. We are a startup company and my boss wants to build a fast product which is extremely easy to manage and so we decided to use Google App Engine, NDB and Datastore for that purpose and it's really fast. I mean that we can write a code and immediately deploy into production and make the chance and we need to handle something about transaction and consumption but we can overcome it easily but one day my boss asked, because we have customers and my boss is asking how many users currently on our system and at that time I need to, it's an easy question but I just need to write in a counting from our database but Datastore has some limitation for NDB when we do counting, it's actually affecting all the data down and count it one by one, it doesn't support any aggregation. So in that scale, we have to write a very complex query because with no SQL, people need to write a code and it doesn't support relation join like SQL database and aggregation is a nightmare not just only with Datastore, if you use MongoDB, you're writing counting or something relating to aggregation is like time, even with the map reduce, writing the coding is very difficult with relationship and we might use reddit and catching for counting but if the boss check the question how many people have, like how many users in the system have more than three mutual friends, it's very difficult to write in coding but for us, we need to take time to write the Datastore and report and at that time I gave up and finding another solution for our system that I came up with, BigQuery. So the advantage of BigQuery is that people don't need to learn about Python code or map reduce to writing, writing, writing report, like everyone like the AI, this is an intelligent data scientist can use SQL to get the data from our Datastore and extremely fast comparing with the Datastore because everything is handled by Google themselves and we just need to make sure how we can get the data from the Datastore back into the BigQuery and the cost is very affordable because comparing with a standalone SQL server, the BigQuery costs just how much you run, how much data you're running, scanning and this is an advantage but for us, we need to know how to get the data into the BigQuery. For the reverse presentation, people talk about how we can use BigQuery to query the data but the first thing we must do getting the data into the BigQuery, it doesn't like a database, it's much more like a handle so we need to get the data into the BigQuery. One of the first versions of the BigQuery system we use is that we directly using the cloud Datastore as the application database and our application has a prompt to trigger a spark into the Google Cloud Storage and like once an hour, we trigger an spark and we have configured object notification on the Google Cloud Storage so it would notify our application if we have a new backup file and then application need to trigger import from Google Cloud Storage into Google BigQuery and at that time we also have open-source, dashboard, memory, Datastore, virtualization, the chart and the data but we also have a problem that by importing loading data from Google Datastore backup, it takes about one hour for all the process so in that case, backup operation also affect very bad into the performance of our application because it's costing reading and reading into the data stuff, block other operation on our system so we cannot do it more frequently and the boss, if he wants to know the data write down not one hour before so I have two ways, the easy way I just need to file another boss and the hard way finding a solution. Okay and at that time I make some research about how we can load data directly into BigQuery in green time and they support an ADI that ADI to insert data into BigQuery using raw data like insert, it's raw by raw. If you have a new data, you just need to put into the BigQuery and it would available immediately for query, not for other tasks like export but for query it would available immediately and in this case when our application make any chance on entity or database, we make chance on cloud Datastore, we also have published an event into Google Cloud Pubesum and with Google Cloud Pubesum they guarantee that our message would be same but it's not, it would same twice or many times because they have a retry mechanism and we build a Datastream worker using Kubernetes and Python it's very simple call to insert data into BigQuery and at that time we have another problem with the insert real time data that it just support insert only, we cannot do update or delete any recall into the dataset because they just support insert for faster data loading and we need to also to remove the duplication recall because with Google Cloud Pubesum they can send in a message twice or many times because retry because the network and we need to resolve this problem and this is just my simple call of getting data from Google Cloud Pubesum putting the message and re-present the data and push the data into the BigQuery and then make acknowledge for the Google Cloud Pubesum and after that I find out the solution that we need just to combine the loading data from daily backup and the streaming to get the real time data and there is some problem that we need to resolve first is we need to combine two dataset so for example if you have a query like get some number of the user on the system we need to submit on both the real time dataset and the backup dataset and we combine together using a union as well command and we also need to duplicate the ID and get just the last most update of the data so that when data check it mean we remove the old data get the new data only but for for deletion we need to wait about one days when we have a full backup of our system and we need also have policy to aspire the data so that with real time data just keep about one or two days and we don't need to keep it because it contains every data on the system we need to keep it very small for faster query and this is from Petrastic from Google that we can do as well to get the data from real time dataset and remove the duplication and this is how we come to get all the data real time it's called near real time not really real time and this is the result this is the result that we also use the data studio for visualization the the check and the report and you know that but under the data studio is still on beta but the fancy UI and the easy to use and a lot of tools make the for certify so he loves to keep it on our system as a dashboard and replace the redacted open source software and this is just my experience about building the big query using using big query to building and real near real time on this system and I think that the talk is just a lightning talk about how to use if you have any concern you can go to google.com and documentation they have on about the Petrastic and documentation about how to build a full system and now I think if if anyone have any question about using big query I think that if you join in this section you already know about database you already know how to make the report and how to use the thing so I think that with two years of using big query and Google Cloud platform I can discuss about this one for you except the guy working compound platform so I think you have a small experience let me so question yes did you have any duplication issues in your real time screen yes duplication issue duplication issues in the pop-up yes yes I have duplication at on time because the Google pop-up usually training is to message instead of one and for duplication I use the rather than remove it from big query I write an hql query to get the most updated data so if you have to raw with the same ID I just get the most recent one into the data so that we can overcome the problem of duplication and also we can get the most updated data instead of the one and then on the other side missing data hmm yes I we have some missing data with the with the it seems like the problem of the big query not from the Google pop-up because sometimes the insert the insert process get the data into the big query we can get it from the dashboard but from the very is not immediately available it take about few minutes for the data to available so I call this at near real time not the real-time system and what also error rates with that error how many percent hmm I don't remember but duplication just about two five percent only is not too much for our system and is there anybody interested in big query and how to use the big query for the application any question what would be an alternative to big query if you wanted to evaluate okay for the anti-native to big query actually before big query I used hard up and have hard up for like big data processing at the data warehouse and have for as well as well like query and for this one I think big query come with very affordable cost with the hard up and the hide we have to set up about 2 plus 3 and running for like our data but for big query we just need to pay for what how much data we process and just each time we run process they do catching and they do optimizing for us with the hard up or hide we need to take it by ourselves and the time is not guaranteed because if our data is not very huge but for the hard up it might take few minutes to get the dream result but big query it's about few seconds for the data so I think it's more affordable price but people talk about redshift from service from Amazon AWS and from my opinion that big query is suitable for small stack up because the data is not too big and we don't need to manage all the infrastructure for using the data analysis service but for the aim result you need to think about how many workers are instant that you need to running a redshift cluster and you need to manage all the data and the structure of the data and the redshift and so from my opinion that big query is better for small stack up for the cost efficiency Any more questions? Any more questions? That would be cool Well, thank you Thank you for speaking