 Thank you very much. So welcome everyone and I really appreciate you spending some of your Saturday here with us. I can talk about stream analytics all day long, but we won't today. We'll keep it to 40 minutes. And actually what I'm going to do in this session is actually two parts. The first part is part of a session that was run at San Francisco Next, where the product manager for Dataflow Sergey went through some of the new features. And then for the next part, we're going to actually run through a full end-to-end demo that was available at the San Francisco Next demo booths. So it is a bit tight on time, but we'll see how we go. Okay, so starting with stream analytics. What is stream analytics? For me, it is the continuous processing of data without having to materialize all of it first into a batch format. Now there's three, it's a little bit of an over-certification, but there are three broad categories of processing that I consider. The first one is transport. That is just taking a stream of data that's arriving, for example, from IoT devices to files or to your data warehouse, for example, BigQuery. The next category is where you're actually doing some enrichment of that stream. So imagine you have a point-of-sale system on your stores, it's sending information as you're making sales. With that, you have a product ID. And in the stream, you join that product ID with the product name and then land that denormalized value into your tables. The final piece, which is a bit that I get very passionate about, is doing in-stream analytics. So that's where you're actually doing aggregations and computations, statistical functions on the stream as it's happening and connecting that to the different parts of the business. So with those three categories in mind, what we're going to do is first go through the options that you have to build out a typical streaming pipeline within GCP. First of all, let's think about the data points. So we mentioned IoT devices already. And by that, imagine a factory that's full of sensors and every machine is sending information continuously about its state. The other one that is very common is clickstream. And by clickstream, we mean the events that are happening. One example, users are on an application in a mobile and they are clicking through various things like add to cart, etc. All of these are events that are clickstream that's being sent. And of course, the same will be true for web applications. So if you imagine this continuous stream of data flowing through, the first thing we need to do is absorb this data in some way. And some of the common tools, so Apache Kafka would be a choice, as well as Google Cloud Pub-Sub, which is a fully managed publish and subscribe API. We'll talk more about that a little bit later. Once you've absorbed this data, you then need to do some of the processing. So the three categories I described. Now to do that, the way that we're going to describe through it today, we're going to actually use Apache Beam as the programming model to describe the pipeline, describe the transformations that we want to do on that stream. And then we're going to make use of data flow as the execution engine for the pipeline we designed. We can make other choices as well. We could use Apache Flink or Spark as well to run that programming model. And we'll go through that in more detail as well. Downstream of this, we now need to absorb the data. Quite often, this will be into a data warehouse like BigQuery. Depending on some use cases, you may actually want to split deliver what the results. In other words, you may want to put it into your data warehouse. But you may also want to put it into something like Cloud Bigtable. One example, for example, would be Time Series, where we want the Time Series to go to BigQuery. But there are use cases where we want that Time Series data to be available in a key value store. Beyond that, obviously, there's then the visualization and the analysis you can do with that data. So let's go through some of these components. I'll pick the Google Cloud Platform ones here. So we have Google Cloud Pupsup. This is a global publish and subscribe API. It is serverless. In other words, all I need to do to set one up is to create a topic within the console. I don't need to do anything else. At that point, you can start sending messages into it. It will auto-scale. It will do all of the things that it needs to do to deal with the load coming in from day one. So whether it's 100 messages a second you need to send to it, or hundreds of thousands of messages a second. It can also act as a shock absorber for our system, because it will keep data for up to seven days. In other words, if there's any issues downstream in the processing, this absorbed data will stay on Pupsup for up to seven days for us to be able to pick up later. The final piece to think about with Pupsup, which I think is interesting is, we use Pupsup often as the ingestion, the first layer for this stream of data that's coming. But you can also use Pupsup as a glue between your data analysis systems and other departments and different parts of your business. For example, you can send alerts and stuff and use Pupsup as a way of sending those messages to those other systems. We'll actually go through that a little bit in the demo. In terms of Apache Beam and Dataflow, I'm going to talk about Apache Beam first. Apache Beam open source project about three years ago, Dataflow SDK, was donated to Apache Foundation to begin that project. Since then, we've had a really vibrant community. In 2018, I believe it was in the top three of a number of commits for an Apache project. It was also one of the most active in the dev lists. Actually, if you like technical content, there's some really senior Google engineers who work on this. They do a lot of really good technical, interesting discussions on the dev list. If you have a cup of coffee and a Saturday morning to spare, it's worth reading through those. In terms of what Beam does, it is a programming model that allows us to describe the pipeline. It works for both batch and stream. It has specific primitives that allows us to deal with some of the complexity when you want accurate below latency results from stream data. It comes with multiple runners and language. I'm going to go through that in a little bit more detail. This is one of the things folks like about Beam. The vision is to eventually have many languages that the users can use. Java and Python support it at the moment. There are more languages coming. Ultimately, for users to be able to pick the language of their choice and be able to design and build their pipelines. Then the other part is around the choice for the execution engine. Once you've built your pipeline, you can choose, for example, Apache Flink to be able to run and execute the processing or Google Cloud Dataflow and also the Apache Spark. The one we're going to talk about more today is Google Cloud Dataflow. Dataflow is a runner where you can run the Beam pipeline onto. It's serverless and fully managed. It has exactly one semantics, which is very important when you want accurate stream processing. It actually deals with all of the things like windows, state storage, shuffling for you. As a developer, there's a lot of stuff that it does for me that I don't need to care about anymore. The reason I like it is I just write my Beam pipeline code. I will design the pipeline and then submit the job. At that point, I don't need to do anything else. I just wait for my results to show up. Underneath the covers, it's actually doing a lot of things. Just to give you a 30,000-foot view of what it does, I'll just walk through what are the things that happen once I submit my job. I type my code. I build my pipeline. I press P run. I go. It submits the job. At that point, it will schedule the work. It will spin up workers. It will then deploy my code to all of those workers. It will also deploy the code that it needs to be able to do things like shuffle and windowed state to those workers. It will auto-scale those workers to the right level, so it will increase them and decrease them based on the load that's coming in. It will take the inputs that I have, for example, if I've given it several streams and a database to read from. It will shard that data, move it to all of the workers. If the worker fails, it will move the work off that worker and put it to another worker. If there is too much work on a single worker, it can actually migrate some of the work to other areas. All of these things are happening, including niceties like when I've written log.info or log.error messages in my pipeline. It will push all of that to a central location from all workers. That location is for us Stackdriver, so I can actually go to Stackdriver and actually see all of the logs of what's happening on my pipeline. All of that is being taken care of. All I did is set p.run execute. I don't need to care about any of the stuff that's happening underneath the covers for me. When it first came out, it did have a very steep learning curve, and we've been continuously trying to improve that. Obviously, that kind of job is never done. There's always something else you can do to make it easier. Some of the things that folks have appreciated has been around a couple of items. One is the creation of a job UI, and the other one is SQL, which Jan's mentioned at the start, in terms of the job UI. If you recall the three categories that I mentioned of processing, one of them is transport, where you're just taking a stream or some content and moving it somewhere else. You're not doing any transformations. You're not doing any enrichment or calculation. Now, to have to write code to do that is kind of boring. It's tedious work, and that's why we have some templates that allow you, without code, to actually be able to submit a job just by going to a forum and filling out the details. There's some very common templates for common things that people want to move. For example, it pops up to BigQuery, and we'll show a quick demo of this in a moment. That makes life easier. One of the languages that is now supported in Beam is SQL, obviously a very well-used and well-liked language. It does things like... I'm a Java person, and if I needed to do a join in Beam, I need to write all that out, which is pretty painful. If you were doing it in Scala, it's a little bit less. Well, that's what we like to write. It's a simple join. We just want three lines of code. Let's keep it at that. With SQL support, we have the ability to write SQL within the pipeline. One of the things that was announced, it's currently Alpha, is the ability to write this SQL in the BigQuery user interface itself. Rather than choosing BigQuery, you can change to Dataflow, and it will submit the job for you. I have a show of hands. Who used BigQuery here before? Okay, a few folks. BigQuery is a good, petabyte-scale data warehouse in the cloud. Actually, I'm going to go through a quick demo of that. What we're going to do is we have sales information coming from a point of sale system. It is going to a PubSub topic. What I want to do is do the second type of work, which is enrich that stream. I want to take that information so that I can have the store, city, and location added to that piece of data for downstream processing. To do that, let's go over to... One of the things we can see here is in this mid-center pane is where I can actually normally type my SQL. I type my SQL, and if I was to hit Run, I would be using one of the datasets available in BigQuery. Now, what I'm going to do is switch this to make use of Dataflow instead. To do that, I'm going to go to more query settings and switch the query engine from BigQuery to Google Cloud Dataflow. This will make some subtle changes to the UI. Now, I'm sure you don't want to watch me type a lot of SQL, so I'm going to cut and paste the SQL I prepared earlier. What we have different now with this UI is if I go to Add Data, in the previous BigQuery format you would only see the datasets that are available to your BigQuery. Here, we actually have a new one which is Cloud Dataflow Sources. In the Alpha, that is going to be PubSub. Now we have a stream as a source, and I could do a search, but I've actually set this up for us already. We can also see there's now Cloud Dataflow Sources available within the UI. If I look at Sales Per Store, we can see there is a schema attached. In that topic, I'm sending JSON messages, and I have defined the JSON schema in a YAML file and uploaded it to the system and said, this is what the JSON looks like. Now the system knows what it's going to get when it looks at that stream, and it's going to be able to use it within the SQL statement, including validating my SQL statement. So here, the SQL statement is making use of PubSub topic and joining that with BigQuery table. And the SQL statement is valid. I'm going to create Cloud Dataflow Job. Now I'm just going to hit next on this. It allows me to say where I want the job to, the table that is the output of this to be sent to. So let's just call it demo set and leave it at this. Now this will actually take a little while. It takes three, four minutes in alpha to actually get there. To save us time, I actually started one this morning. This is the Dataflow front page UI, which shows all of my jobs. The gray one is the one I just submitted, so that's queued up. That's going to take a few minutes to get started. I started one yesterday, which is this DF SQL inventory. So that's the monitoring interface with Dataflow. We can actually see the Dataflow job in the middle and some stats metadata on the side. It looks simple, but actually if I click on this expand, we're actually seeing what was interpreted from the SQL to the Dataflow pipeline. So what it's done is translated that SQL statement that I had into a pipeline DAG, a directly cycle graph, and this is that interpretation of what it's doing under the cover. So here we can actually see it's reading from up and further on down before the join. We can actually see the read from BigQuery. So that has allowed me to build a pipeline against the stream and the table in BigQuery, all without coming out of a user interface allowed to write SQL. Okay, so let's now move back to our slides. The next part is general availability of Dataflow, the stream scaling. So I described earlier all of the steps that happened underneath the covers for me when Dataflow does my work when I submit a job. It spins up those workers, it puts my user code in, it puts the Dataflow code in for things like shuffling and things like the window state. Now the new service, by the way, which is optional, you do not have to use this service, takes a large chunk of the stuff that's being, that's related to the Dataflow service away from the workers, leaving just my user code to be running those workers, and takes them into the streaming engine. So here there's some very direct benefits to a user who wants to make use of this. First of all, we have a much better idea in terms of utilization because it's now just mostly user code in those workers. So our auto scaling becomes much smoother. And there's a blog on the site around this which has some graphs that show some of that effect. It's better supportability. So before you needed to run an update on your Dataflow pipeline to get new features around these parts, the shuffling and the streaming, now because it's all in the service here, this is transparent to your pipeline. So you don't need to update it yourselves for us to get that benefit. And finally, there's less usage on the, there's less resource usage on the workers. So we're getting much more out of those CPUs. Now this service does, it is optional and there is a cost associated with it per gigabyte. It's currently available in four regions. Okay, so now I'm going to move over to an end to end demo. So I think if we're in a restaurant, this will be the point where they give us some glass of water to refresh our pellets. And we go straight into demo mode. Okay, for this demo, we are going to use a retail company as our sort of demo environment. And let's imagine we have some, a retail company with many stores and the VP of sales has asked us to improve the sales process specifically around couponing. So today her results of sales from stores comes at the end of the day or the next day because everything is batch processed. She wants real time data so that she can make couponing decisions based on things that are actually happening. For example, a store has got too much inventory of fresh produce and it's going to start deteriorating and we need to get rid of it. So that would be one option or making sure people come to special events at the stores, things like that. The other part is around the targeted couponing. Now if I have, I'm going to go through the real time one. If we have time, we'll do the targeted couponing as well. That touches on some of the ML pieces. The general pipeline that we will use to do this. So the point of sale system, this is the scanning of the devices, what's in the basket. That from the stores is going to be sent to Cloud PubSub to absorb the data stream. Then we're going to do the processing. Now we're going to do three things with every element that we have. The first thing that we do is we're going to send it every minute to Google Cloud Storage for our data lake as just Avro files. So this is kind of like raw, untouched, untransformed data. We're just going to keep a copy there. The second thing we're going to do is we are going to, without windowing, stream it directly into BigQuery. So there we're going to use BigQuery's ability to absorb streamed data and update tables fresh for our sales VP. The final piece is where we're getting into that third category of stream processing where we're going to do some basic analysis on the stream. So we're going to count the number of sales of the produce for a store and send it back to PubSub. And the reason we're doing that is then PubSub becomes a way of connecting our data pipeline to the other parts of the business, for example, the inventory department. And so the inventory department knows up to the minute or up to the almost real time what is happening in the stores and can start scheduling deliveries and stuff as the need arises. Okay, so going on to the demo itself. Now as we mentioned, PubSub is a fully serverless API so to show you the creation of one, it would be a few clicks, not that interesting. What I'm going to show you is some of the monitoring statistics from using Stackdriver. And on the top left here we have the stats from my orders topic. And here we are seeing around 200 orders per second coming into the system. Now obviously our stores are super successful because if you do the math, 200 a second of sales is something like 10 million a day. So these are really, really big stores. But we've jumped numbers up to make it more interesting. So there's 200 orders and every order will have multiple lines. So every sale will have a basket of say 10 items or 6 items. So there's actually quite a decent volume of data. What's really interesting is the graph on the right. Now this graph shows how long every message is actually staying in PubSub before it's being processed. And we can actually see the median is one second. So all of those messages are coming in. They're taking about a second before we're pulling them off with Dataflow. Now Dataflow, there's some graphs here underneath around Dataflow's performance metrics. Rather than talk through those, I'm actually going to go and show you Dataflow again. So here we have the monitoring user interface. This area, this is the description of the pipe line. So I've written this in code and it has interpreted that into these series of transforms. As you'll notice, it is a DAG, so there's no cycles allowed. But here everything is flowing downwards and we have those three activities that we were talking about. So the first one on the far left is the writing to Avro. So we do a one minute window and then we push the files to Avro. So this is the raw untouched data. In the middle path, there is no windowing. There is no batching. There's none of those pieces. We actually just stream it directly into BigQuery. And that's going in at a couple of hundred a second. Obviously BigQuery and Dataflow can cope with much, much higher numbers into the hundred thousands. And on the right hand side, we're doing stream analytics. Here we are doing a filter on fresh produce. If I click on that, we can actually see since this started, there was 12 million orders processed by this line. And what I'm doing in that piece of code is I'm taking the order, I'm expanding it to all of the lines that it has. So all of the items in the basket, it's now a bigger piece. And then I'm filtering all elements that aren't fresh produce. So there's a massive reduction. So we're going from 12 million to 4 million. That's actually much higher if you think that 12 million isn't multiplied by the number of lines. So if there was 10 line items per order, that's 120 million that we reduced to 4 million. We then do a 10-minute window and we start counting. The reason we have to do a window is in order to do a count and average, you need to stop at some point. And we need to tell the system when to stop. What's important is I'm using event time rather than processing time for this. So it's not based off the wall clock time of 10 minutes. It's based off the events 10-minute time, a little bit, it would take us too long to cover that in detail for this session. Next, I actually output this data back into PubSub. So here we could imagine that down here somewhere there is our inventory department who is actually picking up this information and able to keep inventory fresh at our stores. OK, now let's look at the BigQuery path. Now I'm back into BigQuery and I'm going to be using BigQuery standard without the Dataflow engine. Here I'm going to be looking at a specific table which happens to be the order with lines table. So this table is the table that Dataflow is pushing information into. The schema for that table is the order and there is a repeated field. This repeated field is the order lines. In other words, it's a collection that we have set up. If I look at the details for the table, it's not particularly huge. It's got 340 gig worth of data and there's around half a billion rows of information. It is partitioned and more interestingly at the bottom we can actually see that the streaming buffer is active. In other words, this table is now getting constant updates to it from that streaming pipeline. I'm going to run a query here and the query is going to do some enrichment of the data and more importantly, I have put in where the time is less than 10 minutes ago. Let's run that query. Now BigQuery uses a column or IO storage structure which means if my select statement doesn't choose all of the columns, it's not going to have to process all of the columns. It will prune those off. Also, because I had that time constraint and it's a partition table, the system automatically ignores partitions that aren't required. In other words, all the other partitions that were in the schema, so instead of having to process all 300 gigs, we only had to process 70 meg. And here we can actually see I put in this column for convenience the current time 34606 and we can see that the West Palm Beach, which is one of the highest in number of sales in the last 10 minutes, had some updates there about three seconds ago. The other one, which is, I won't even, I pronounce things badly, so let's say Houston is four at 04. So this table has got fresh real-time data for all of the analysis. And I haven't had to set up a single system. I haven't installed anything. I've just written code for that entire pipeline. Now that we have that data, we've solved the real-time problem. How are we doing for time? Are we okay? Okay, so I'll do the couponing piece. Okay, so now I'm using Data Studio. And with Data Studio, we're actually able to visualize some of the information that's in BigQuery. I did open this page up just before the talk, so the numbers at the top are going to be slightly off. But here we have, you know, a visualization of the sales in the last 10 minutes from our stores. We had the facilities data. So this ties it into the city. So we now have zip code and, therefore, that long locations. We have loyalty cards. So as you go to the supermarket and buy things, if you use your loyalty card, that ties that sale to you as a user and it's your ID. So we have loyalty card information for those sales that they did provide that card. And using GIS functionality in BigQuery, we have another table that actually gives us the loyalty program total lifetime sales for me as a user to stores within 20 kilometers of me. And that functionality is basically distance and that long we can use GIS functions in BigQuery to compute that. Okay, so we've got lots of interesting data. We have some data on our stores and stuff and we have the live fresh data. This is useful as one of the data points to start building more interesting couponing. Next, let's see what should we use as the item that we do couponing against. So in our stores, globally across all sales forever, which was half a billion rows of information, the top 20 items that are in a basket tend to be fresh goods, not surprising. And in the US, actually, it tends to be bananas, organic bananas, et cetera, that they will purchase. So that's a very common purchase when people are going to the grocery store. In this one, we've got some geographic stuff and done the New York state. But essentially, this is an item now that we know that most people want in the store. And so it's useful that earlier on in that data flow, we were checking for fresh produce so that we know we've always got goods like bananas available as that's one of the common things people will come to our stores to buy. And maybe this is something that we can use in our couponing to bring people to the store. So one way we could do that is we could start exploring the data with SQL. And I'm not a data scientist. So let's start with basic SQL. Let's have a look at the number of, for a specific user in our loyalty program, what time of day they will generally make purchases. So if I run the query, actually, I ran it before anyway, but let's run it again. What we see here is a customer ID. And these tuples of afternoon, F afternoon, T, F for false, T for true. Essentially, this user has bought that produce, bananas, et cetera, in afternoon ones, not in this afternoon. So all of this data together gives me some idea of when that user will buy something within a day. But what if I then want to add the frequency, the location, the weather, all of these other factors. You can keep writing more and more complex SQL to try and do this. Or as one of our folks in one of my data scientists colleagues did, he just used linear, sorry, logistic regression to actually solve the problem. Now, with BigQuery, because it supports ML within BigQuery itself, we don't have to export the data to another system, run tests, do notebooks, et cetera. Within BigQuery itself, we can run the ML that we need. So here, we have the ability to write in SQL with these extensions, create or replace model, a logistic regression model. And the other thing that we're doing is we're doing the feature extraction using SQL. Again, very easy language for a lot of people to use. So therefore, it makes it very accessible. This will go off and build a model. I'm not going to run that now because it will take a few minutes. And essentially, at the bottom, we will end up with a model on the left-hand side here. This will buy banana model. And if we look at the details, we can actually see the number of iterations that that process ran and some results. So the training, and this is also available in BigQuery, you can actually see some of the graphs of the trainings. And also, when you're evaluating the model itself, we can tweak the score threshold. Again, I'm not a scientist, so I'm not going to attempt to explain all of these pieces. But essentially, we have our confusion matrix of when it got it right in terms of guessing whether that particular user is likely to buy that particular produce. Finally, we have bringing it all together. So we can run a query against that table to give us a prediction of folks who are likely to buy that produce. In this case, we're going to use some bounds of smaller than 60% and bigger than 50%. The reason smaller than 60%, if they're most likely going to buy it anyway, then why send them the coupon? And bigger than 50% is if there is a cost associated with sending the coupons, they want to make it for the right target audience. But that's entirely a business decision. I can run that and get some results, but then you can imagine I can now run this with the real-time fresh data. And I can actually start making couponing decisions within a day, including tying it back to what inventory we have at different stores. So all of this we were able to do because we were doing everything in stream pipelines. Going back now. And so thank you very much. And again, I appreciate you guys coming on on Saturday to spend some time with us. Thank you.