 Good afternoon everyone. A very very warm welcome to this session. My name is Russell Nash, I'm a solutions architect with Amazon Web Services and I am delighted to be here today giving this talk to you. If you're trying to pick my accent, I was born in England and emigrated to Australia a few years ago. On any given day, I either identify with being English or being Australian depending on who's doing better in the cricket. Today I'm English after Australia's absolutely shocking showing in the first Ashes test. At the end of this session we're going to have a bit of a Q&A and we've got a few t-shirts to give away. So please stick around and ask some interesting questions and we might be able to throw some t-shirts your way. Sorry. Oh, okay. Okay, just Q&A, no t-shirts. We'll get to the agenda in a second but I just wanted to give you a quick understanding of what this talk is going to be about. It's basically going to be a love story. It's a love story between big data and analytics and any scalable system that needs a lot of infrastructure and it's a love story between those algorithms and those architectures and the cloud because a lot of the talks that you've heard in the last couple of days have been about doing things at scale and if you do things at scale you need a lot of infrastructure and where's the best place to get a large amount of infrastructure? It's in the cloud. And so essentially it's this fantastic, beautiful tale of two systems that intertwine really well together. The actual agenda is pretty straightforward. I just want to say a couple of words about AWS just in case you're not familiar with Amazon Web Services and then we're going to talk about the Lambda architecture, which some of you might have heard of. And this is a very dark slide. So for those of you who haven't heard of AWS, we're a provider of cloud-based infrastructure and we've got global regions in many different places around the world, obviously primarily in the US but also Europe and Asia Pacific. And so any of our customers can use any of those regions to deploy their cloud-based infrastructure. Now what you will notice is that there's currently no blue box in this area. We're looking to change that. So we recently announced that we are going to open up some data centers within India, which will give you a local region for you guys to deploy your infrastructure in. So we're very excited about that and we're hoping our customers are as well. So we're looking for some time in 2016 for that to kick off. So in terms of the services we provide, I'm obviously not going to go through all of this, but I just wanted to touch on a couple of points. At its most basic, we have Compute. So you can spin up instances, various flavors of Linux, various flavors of Windows on varying different instant sizes depending on your workload and then treat those machines as you would a normal on-premise machine. You can install on it whatever you like, you can do with it whatever you like. So as we go along, you can think about the fact that pretty much anything you want to do, you can just spin up machines that we call EC2 machines and use them as you would normally. But on top of that, we've also built a lot of platform services just to give you that higher level of management, just to take away some of the heavy lifting of actually getting some of these applications up and running. And so in the analytics space, I'm going to talk about some of these. We've got a managed Hadoop service, real-time streaming data service, a SQL Analytics database, and a NoSQL database as well. And so these, as I said, are just designed to just get you up a level so that you're having to do less management of the infrastructure and can spend more time doing the fun stuff. Now the other thing to note here is that on top of this again, we have partners that actually will also remove even that level of abstraction and get you moving even faster. So we've got people like Cubol, for example, who will run their infrastructure on top of AWS to give you even more level of abstraction from the underlying management piece. Now one of the points I wanted to make as we go through this is that there's obviously many, many different ways to do anything that you want to do these days. There's lots of different open source options. There's Amazon services, there's all sorts of third parties. And sometimes we can get a little bit focused on the particular piece that we know and we kind of push that against other options. And so I wanted to make the point that it doesn't have to be that way. And I saw a quote the other day that I thought really summed this up quite well. And this was basically it. And it's true, right? So we get very focused on the stuff we know. So we say, my database is better than your database. No SQL is going to rule the world, everything else is dead. And that's simply not the case. As you will know, every single technology has got its strengths and its weaknesses. And we certainly feel that. So we certainly feel that whatever's the best particular infrastructure for our customers will support that. We don't want to push you into a particular technology that we don't think is particularly right for your workload. So let's get into the Lambda architecture. Now for those of you who looked at the name of the talk, you might have said that sounds suspiciously like the Lambda architecture. Why didn't he just call his talk the Lambda architecture? And the reason I didn't is because the word Lambda actually means a lot of different things now. And I didn't want to make it unclear as to what the talk was about. So let me cover off on a couple of those. So the first one is the Lambda architecture, which we're going to talk about. But confusingly, AWS has released a service called Lambda which has got nothing to do with the Lambda architecture. Although you can plug it into the Lambda architecture if you want to. Just to make things confusing and we'll talk about that a bit later on. What the Lambda service does is it is an event processing service where you can give us code and we will then fire that code in response to an event. But you don't need to provide any infrastructure to run that code. We'll do that all for you. So we'll have a good look at that later on. The third one is obviously Lambda functions. So those of you familiar with functional programming are basically anonymous functions that you can use. And then there's a fourth one. Who was a teenager in the 1980s? Really? Seriously? That's all right, you could admit it. We're part of the wise generation now. So who remembers Revenge of the Nerds? Anybody remember that movie? Some people do. So in Revenge of the Nerds, the guys in there, they invented this fake fraternity, a college fraternity and they called it Lambda, Lambda, Lambda. So whenever I hear people talk about Lambda, it always reminds me of these very kind of nerdy guys running around trying to make it in college. So let's talk about the Lambda architecture. So the Lambda architecture was invented by a guy called Nathan Mars. Some of you might have heard of. He's got a little bit of street cred because he wrote Apache Storm. So he kind of knows what he's talking about. And his whole point was that a lot of batch processing systems, heavy-duty data processing systems are overly complex. And so he says this great quote about complexity. And he talks about it in a couple of terms. He says hardware's going to fail. Everybody knows that. But he says humans fail as well. You introduce errors when you do a release. You suddenly realize you've got issues. You've got to make a system that you can easily fix, that you can easily roll back to a known state. And so that's why he invented the Lambda architecture. Basically, at its core he's talking about large amounts of new data coming in. He says you have a batch layer. So you take that data and you put it into some kind of master store. And then you have a batch layer that then does your processing, your enrichment, your business logic. And then you place those views into the serving layer. And the serving layer, you then query. Now one of the keys to this architecture is that in the batch layer you've got this what he calls this immutable data store. He says you never update that data. You only append to it. If a value changes you don't change it in that store. You simply add another record and you time stamp it. And the reason you said you do that is because then it's much easier to go back and recreate any of your views in the serving layer if you introduce some kind of issue. Now he said that's all well and good. But what about if you want faster access to that data? So he introduces something he calls the speed layer. So the speed layer is designed to take those events, that data, those transactions and get them into the serving layer very quickly. So it doesn't necessarily have the same level of enrichment, same level of business processing that goes on in the batch layer. But the data is there more quickly for you to query. And so then through the serving layer you can query both the historical, richer data and maybe less rich data. And you can marry them together. And then eventually all the newer data, the speed stuff will get overwritten by the historical stuff and enriched etc. And so he talks about this. He said the stuff in the speed layer is only there for a very short time. So if you make a mistake in that layer it doesn't really matter. Now there's obviously a bunch of different ways you can implement that architecture. One of the common ways we see is the batch layer lends itself very much to some kind of Hadoop implementation. Apache Kafka is a nice fit for that speed layer, that ingestion of very large volumes of data. And then some kind of database as your serving layer that you can then query. Now if you wanted to run this in AWS on the cloud, you could obviously as I said you could just spin up EC2 machines, install Kafka, install Hadoop, install the database, manage that yourself and off you go. But what I wanted to talk about was what if you want to use some of the high-level Amazon services, some of the platform services to remove some of that management underneath. And so we've got a couple of services that fit into those different areas. So I want to talk about Amazon EMR, which is Elastic MapReduce. For the speed layer I want to talk about Amazon Kinesis. And then for the database we're going to touch on Amazon Redshift as well. And then once we're done with that we're just going to very quickly look at all of alternate approaches as well. So let's have a look at EMR. So what is EMR? So EMR stands for Elastic MapReduce. And basically all it is is a managed Hadoop service. So if you think about your normal Hadoop stack, you get a bunch of machines, you install Hadoop on it, you get all the machines working, talking to each other, etc. and off you go. What Elastic MapReduce does is it doesn't replace that, but it manages that process for you. So what you would do is you would get onto the EMR console and you would say I want this many machines, I want this distribution, go and do it for me. It'll spin up the machines, it'll install it all for you, make sure the masternodes there, the data nodes are there, they're all talking and then it ties a lovely red bow around it and hands it to you and off you go. So you get a much faster time to start using the cluster rather than running around trying to get it to work. And then of course you can plug in all the other fun stuff on the top. So at the end of the day it's just Hadoop, so you've got all access to all the ecosystem that goes along with that. Why do people like EMR? The management is a big piece, that ability to get going quickly, the scale of it and also the cost of it as well and we'll look at that in a bit more detail. So in terms of the management of it, in a lot of ways you can spin up an EMR cluster, you can do it through the console, you can do it through the API, through the SDK, through the CLI depending on what you want to do. If you go through the console, this is part of the console. So we ask you questions about what do you want to call a cluster, how many machines do you want, blah, blah, blah. But we also say what do you want installed on the cluster? So by default obviously we'll put the ubiquitous hive pig here. But then we've got a drop down to say, do you want spark on here parlor, ganglia, et cetera. So that kind of makes it easy to just say this is what I want on my cluster. But often customers want more than what's on that list and so you can use a bootstrap action to install whatever you like. There's a whole bunch of obviously of other ecosystem components that aren't here, you can very easily install them as well. Now here's a picture that you can't possibly read, but the idea of this is just to show you the different families of instance types that we have. So at the end of the day EMR is just running on EC2, EC2 being our virtual instances. And you've got a choice of a lot of different families of instance types. There's ones that are heavy in terms of disc, others that have got more memory, some that have got more CPU depending on your workload. And they come in families, so this one here might be the kind of general purpose family. So down the bottom is basically CPUs and this axis here is memory. So you can see you can kind of pick where you want to be on the CPU versus memory map here and choose your instance types. So that gives you a enormous amount of flexibility to choose a different type of instance depending on your workload, depending on what you're trying to do. Now let's look at cost and time. Now this is really interesting because if you're buying your infrastructure, if you're renting your infrastructure by the hour, then to work out how much it's going to cost you is pretty straightforward. So if you look at this kind of equation, you've got basically on the I always forget which is which, I'm assuming is that the Y axis, the one that goes up and down? Yeah, thanks. It's been a long time since I did that at school because I was a teenager in the 80s unlike you guys. So as you can't see So basically your equation is how many CPUs do I want to throw at it and then how much time is that going to take? So to work out how much that's going to cost you is basically just whatever is under this red rectangle, right? So the more grunt you throw at it, it'll cost you more but you might do it in less time. So let's say you have a job that takes five hours if you throw a moderate amount of CPU at it. But if you throw a lot of CPU at it, then it's only going to take one hour. But the size of that rectangle is the same because you're only paying for your stuff by the hour. So you may as well just throw a lot of CPU power at it and get it done quicker. Why would you not? And you can do this with the cloud because we've got a lot of infrastructure. Throw as much at it as you can, get your stuff done quicker but don't necessarily pay any more for it. Now there are other ways to save money as well with EMR. So let's look at this picture of a Dalmatian. We have another aspect of the EC2 market called the spot market. So when you want to rent an EC2 machine you can do that with what we call on demand. So you can just say give me that machine right now on demand and there's a price attached to that. But we also take any unused capacity and we allow you to bid on that. So for example if an instance type was 20 cents you might say I'm prepared to pay 3 cents for that. And so there's a spot price that fluctuates obviously depending on demand and if you bid for an instance and you're above the spot price we'll give you that instance for whatever the spot price is. So how does that work with EMR? So let's say you spin up a cluster with 4 on demand nodes and your job takes 14 hours. So the cost is those 4 instances for 14 hours at 50 cents. So let's say 50 cents is the on demand price, the going rate for that particular instance. So that costs you $28. Now in scenario 2 you keep your 4 on demand nodes but you add 5 spot nodes to that. So now obviously the duration comes down because you got more grunt in your cluster but look what happens to the price. So now your 4 instances at 50 cents are only running for 7 hours. They're only running for half the time. So they cost you a lot less. And then the 5 instances of the spot ones let's just say you got them for 25 cents instead of 50 cents. So the total is $22.75. So you end up with this situation where the job's taken half as long and it's actually cost you less to run it. So that seems to break all kind of normal convention around buying things. How can it take less time and also cost less? So you're asking yourself what's the catch, right? Why don't I just run stuff like this all the time? So the catch is that if the spot price goes above your bid price we will take it back. We will say thanks we'd like the spot instance back. So now you're thinking well that sounds pretty useless. If you're just going to take it off me then how can I use it for long running compute jobs? So what we did is within EMR so this is looking into the internals of EMR we introduced a particular group of nodes specifically to take advantage of spot. So you got your master instance up the top. You don't want to lose that. You've got your core instance group here that have got HDFS attached. You don't want to lose those but then we've got these task nodes and so the task nodes do not have any disk attached to them. What I do is just processing. So if one of those goes away it doesn't really matter. Hadoop's just going to take the job that that was running and give it to another node. So this is where you can really take advantage of the spot market by using the task nodes. So we have some customers that create massive Hadoop costers with a lot of task instances on spot they crunch through their stuff very quickly and cost them a lot of money to do that. So just to bring that point home again this is a screenshot from the EMR management console and a couple of things here. So this is where you can specify what instance types you want for the different nodes. So you got your master you've got your core nodes and you can choose the different instance types. I don't know if you can read that but basically they're grouped into compute optimised GPU instances if you want to use those memory optimised and storage optimised. And then down here you then got your task nodes but you can create multiple different groups of task nodes and potentially you can bid on spot. So here I've said for this group I'm going to bid 5 cents, this group 10 cents and this group 20 cents. When you hover over that little information icon it'll show you what the current spot price is. So you kind of know roughly what it is now and so you can kind of work out what you want to bid for that. So the idea is that when I kick it off the spot price is obviously less than 5 cents when I did this it was about 3.5 cents but if it rises to 8 cents that's okay these nodes will get taken away but I've still got these. If it goes up again to 15 cents so it allows you to kind of take this tiered approach to ensure that your job keeps going even though the spot price might rise. So to give you an example when I did this the on-demand price for the M3X large was about 27 cents and the spot price was about 3.5 cents. So a big, big discount if you can use the spot market. The other nice thing about Elastic MapReduce is we've done a lot of work to optimise the EMR talking to S3. So S3 is our simple storage service it's an object store, highly scalable highly durable a great place to store stuff. If you've got content in S3 obviously then makes it easy to share that with other applications obviously feel free to share it with other Amazon applications but also in any other application you're running on the cloud as well. Plus an added bonus that you can then archive off to Glacier. So Glacier is our low cost cold storage and you can archive stuff off there and again save money on storage. So that's Elastic MapReduce. So let me talk a little bit about Kinesis. So Kinesis is very similar to Kafka if you're familiar with Kafka quite similar in its design. Basically the idea is that you have a bunch of data sources whether you're trying to capture data or whether it's clickstream data whatever it might be Kinesis sits across three availability zones. So an availability zone for us is basically a data center so you can think of it as a physically separate location. So we'll write every message to three physically separate locations when we receive it and you create a stream within the Kinesis service. That gives you an endpoint that those devices can then talk to you push it into Kinesis and then at the back end you then have your consuming applications that read from the stream. So what we see quite a lot is people just doing logging for example, reading from the stream they want to log of every event they just write that straight out to S3. You might want to do some real time metrics though so they might take KPIs out of the stream and write that to a NoSQL database put a dashboard on top of that then you might want to do longer term analysis so you might push that into Redshift for example. Or you might want to push it into a non-Amazon service so there's a spout for Apache storm there's an elastic search interface and others as well. So Kinesis designed to do that real time capture and store that terribly for you. Just a quick point about how you scale Kinesis. So your point of scale is called a shard and a shard will give you a thousand transactions a second or one meg of data. So if you need five thousand transactions a second you go for five shards and you can scale that up. If you want to get data out you can do that in a couple of ways. So you can have EC2 instances at the back end that have these worker threads running on them. Each of those workers talks to a shard and pulls data off the shard and then does something with it. We also built a client library around this to make it a bit easier to write these applications and that's available in Java, Node, Python, .NET and Ruby. Anything else we didn't cover there? Any particular language that you wish? Anything? No? Covered it all? Covered all bases? It's got... Okay. Service teams love to hear feedback so I'll pass that back. So as I said you can spin up EC2 machines and you can run your applications on that but this is where the AWS Lambda service can potentially come in. So you can give us code which we will then run in response to a Kinesis event. So when a transaction hits Kinesis we'll then run your code to then deal with that. The great thing about Lambda is there's no infrastructure underneath it that you have to manage. You basically pay for it in 100 millisecond increments. That's what it's designed for. Short, sharp event processing. That's currently available in Node and Java and we'll add more as we go on. How is Kinesis different from Kafka? I know that you're all asking yourselves. In many ways they're very similar. So in terms of latency, very low. We're talking about real-time processing so it better be pretty responsive. They can both scale up to very large volumes. They both respect the ordering of events as they come into the into the stream. This is where they differ slightly. So the persistence for Kinesis is currently 24 hours. So we will hold events and messages in Kinesis for 24 hours before they age out. That's configurable in Kafka. In terms of the payload, that can currently be one meg in Kinesis for each message. Again, that's configurable in Kafka. In terms of whether it's a managed service, Kinesis is a managed service. Kafka's not. So again, we don't want to be religious about it. This is what Kinesis does. If your use case doesn't fit that, that's totally fine. So that's Kinesis. Let me talk quickly about Amazon Redshift. Amazon Redshift is an MPP SQL Analytics database. MPP standing for massively parallel processing and I like to compare it to the traditional SQL databases that we all know and love just so you can kind of get a handle on how they fit. So in terms of the type of databases, they're both SQL relational databases. So sometimes people think of databases like Redshift, the MPP databases as being not relational. Maybe they're no SQL or can I do joins with tables. It's fully relational, fully ANSI standard SQL. The difference though is in the architecture, not in the layer that you see as a database user. So SMP, it's a bit of a 90s term stands for Symmetric Multiprocessing which basically means that your traditional SQL databases scale really well in a single box or in a small cluster because they share resources really well. So I like to think of the traditional SQL databases like my seven year old. Loves to share. Always sharing, sharing food, sharing toys, great shareer. The MPP databases are like my four year old. No sharing. Hates to share. But, which is bad with kids, it's fantastic if you're trying to scale a database. So the reason that the MPP databases scale to very large volumes is because each node is a standalone unit. It doesn't share disk, CPU or memory with other nodes. So there's no contention as you add more nodes to it. So in terms of scaling, SMP goes vertical, you need a bigger box. MPP databases scale horizontally. Now in terms of the storage, traditional databases usually store their data in rows which is great for online transaction processing. Not so great for analytical types of queries because you end up reading a lot of redundant data. Redshift is also a column in the database. So it means that we store the data in columns rather than rows. So when you're asking an analytical query where you only want maybe a few columns in a table but you want a lot of rows, it's fantastic for that. Goes to the disk and then just runs straight down the disk and pulls all the data out. Which leads us to the workload. So traditional SQL databases fantastic for transactional workloads. That's what they're built for. Databases like Redshift built for analytical workloads. Now I hate it when... VoltDB Let me come to that in a sec. So I hate it when vendors put up benchmarks that show you how fantastic their database is. That was not the purpose of putting this up actually although Redshift does do quite well on this comparison. This was more kind of a bit of a statement about where the industry is going. So this was an old Lab Big Data Benchmark done about a year ago and so you can see Hive obviously is pretty slow. Tez has helped a bit but then you see things like in Allah obviously bringing that query time down and Redshift very much. Kind of in that interactive type of query bucket there. The other interesting thing about this is Shark already deprecated. So if they ran this again today you'd probably see things like Presto up there and obviously not Shark. Yeah so kind of similar in some ways. BigQuery is a bit more of a black box. You don't quite you don't quite have access to all the ANSI standard SQL in there and it's not quite as compliant in terms of frontends as like Redshift. The other thing about Redshift that we like is the scalability of it. So you can start very small so you can start with a node of 160 gig and then you can add nodes as you go and then if you really get into it you can scale up to two petabytes if you need to. So plenty of head room there, plenty of scalability to later grow into it. So that's basically it. So that's the kind of the AWS view of the Lambda architecture. Now as I said we don't want to be prescriptive about this and there's many different ways to do this. So one of the things that we've seen recently is people using Spark for example in here. So using Spark streaming and Spark SQL on Hadoop talking to Kinesis so not using Redshift using Spark SQL and Spark streaming and then just very recently our friends at Cubol open sourced a Presto connector for Kinesis. So again you can then, so Presto is basically an MPP open source product that sits on top of Hadoop. That talks to HGFS, talks to S3 and now talks to Kinesis as well. So a great way to kind of get a SQL interface and query multiple different data sources. So a lot of different ways to skin the cat here. But really exciting to be part of that. Thanks very much guys. Any questions? Now remember, before we get excited about asking questions there's no t-shirts. Oh, I have an update. There is t-shirts. Do we need t-shirts? How many t-shirts have we got? Maybe we've thrown the t-shirts away now. We've given it, let's do five questions. So was your question around what level of compliance is there? Industry perspective. Yeah, so let me see if I've got your question right. So your question was what kind of level of compliance can I get versus various industry regulators, etc. on the cloud? Especially from the AWS Amazon Yeah, so if you go to our website you'll see we've got a whole section in there on risk and compliance and you'll see all sorts of different reports. You can get the SOC reports, ISO reports PCI reports, all sorts of stuff that will give you that level of compliance and we've got a couple of really good white papers that will talk through that as well. You talked about Hadoop. How can we move already on-premises data into the AWS? Because for example I have 10 terabytes of data. How can I move to AWS on-premises to the cloud? So is there any example? Any suggestion? How we can give our solution to the customers? Yeah, so the question was if I've got a lot of on-premise data, how do I get that into the cloud? So there's a couple of options. So you can either, if there's a lot of it, you could use something like Tsunami, for example UDP, which will give you basically a WAN optimization software. So that will maximize your bandwidth when you're pushing stuff into AWS. Or you can potentially put it onto a disk and ship it to us which sounds a bit old-school, but you can fit a lot of data on a disk. Yeah, so if you send us the disk we'll then put the data in S3 for you and send the disk back. Yeah, so if you've got 10 terabytes that might be a good way to go. Yeah, so the question is what so there was never an S2, there's always just been an S3, but S3, so we write everything from S3 across three data centers, so very highly durable. So you get 11 nines of durability and basically you can scale it up as much as you like to in terms of pushing data in. So there's CLI, there's an SDK feeded, run that on-premise and then push that data into Amazon. Yeah, sorry, let me we'll take that offline. Yeah, because we can't give you type 5 t-shirts. Yeah, so within Redshift on a node, you've got obviously the disks attached to the node. Now we will mirror the data from each disk onto a disk on a separate node. So any data that you put in there will have a primary petition and a mirror petition, so we take care of that underneath the covers. So you'll always have your data in two places. We'll also back it up to S3 as well. So you'll also have an S3 copy of that data as well. Do you mean if you're in the middle of a query? Right, okay. So, Redshift it's not quite analogous to MapReduce. So basically it will push the SQL query down into the nodes and then the nodes will then run that SQL query against their data set. That's that concept of stages. If you kill the query half way through, then basically that query will roll back completely. So it's less of a kind of a long running batch processing engine. It's more of an interactive database. Exactly, yeah.