 All right, so let's get started. So this is lecture 24 for computer science 162. So today's lecture is a capstone lecture. So what does that mean? That means we're going to take a lot of the concepts that we've looked at throughout the semester and look at them in real world systems that people are using and also some of the research that people are doing in this area. Specifically, today's capstone is on cloud computing. So we're going to look at distributed systems. We're going to look at the programming paradigms that people have developed for cloud computing. And then we're going to look at a cloud computing operating system. So a little bit of background. So in 1990s, that was really when everybody was working on parallel computing. So people were building multiprocessors trying to put as many single core processors into a box as they could. They typically got maybe 32 to 64 of these. But the amazing thing was that at the same time as they were putting lots of these processors into a single box, they were seeing 52% growth in performance per year. So this was the best time to be a program. Because you write a program and you just sit there. And the program gets faster. And the next year, it gets faster. Don't have to change anything. And you just watch your programs get faster and faster. And that was great. Up until 2002, when we hit the thermal wall. So at this point, what we saw is that the speed, the frequency at which we could run a processor was limited by the amount of heat that it would dissipate to run it faster. So as we increased the speed of the processor, it generated more and more heat. And we had to dissipate that heat. And people started to look at alternatives. So they looked at what the mainframe community does, which is they use water cooling at the time for their bipolar ECL chips. So they started thinking about, let's build servers with water cooling. And Digital Equipment Corporation even had a server they called Aquarius that was water cooled. Never made it to the market. Why? Water and electronics don't mix very well. So not exactly something you wanna do. And even today, there are some hobbyists who will build their water-cooled computers or even more exotically cooled liquid nitrogen or liquid helium cooled computers. And yes, they can push them up to five, six gigahertz, even with off-the-shelf processors. But for the rest of us who are using air-cooled microprocessors, we hit a thermal wall. So Moore's law said you're still gonna have lots of transistors, right? Because the feature size is getting smaller and smaller each every year. So let's just put lots and lots of processors on a single die. And this is the multi-core revolution. So we reached this point where we were gonna have to do this 15 to 20 years later than we thought. We thought we were gonna peak out in processor speeds back in the 80s or even potentially in the 70s, but we've been able to continue pushing up until about 2002, the processor speeds. And so what you see here is a graph that shows year from 1980 up to about 2011 and shows you on the x-axis, or y-axis rather, the MIPS per CPU clock speed that you could get. And what you see is everything was going well until, again, we hit this wall in 2002 and now nothing's going faster, right? We can put lots and lots of cores onto a die, but they don't go very fast. And even that has kind of hit a wall. People were predicting that we would have 100 core processors by now. And yet, commodity processors are typically in the eight to core max range. It's gonna increase, but slowly. One of the gating issues is IO. You can put a lot of cores onto a chip, but you have to give it a lot of IO. So it has to be a lot of memory bandwidth in and out and you're limited by the number of pins. So people are actually looking at technologies now where you would put the memory on top of it and actually go through the silicon of the memory into the processor die. Okay, so this is the hardware side. On the other side is the data. And the data, people are calling it big data. It's really like the data tsunami. Data comes from all sorts of different places. Everything we do is happening online, right? Today is Cyber Monday. Everybody's shopping online right now. Hopefully not here in class. But every click that you make on the web gets recorded. Every ad that gets shown to you gets recorded. Every time you have a billing event, so you send a text message if you're not on a flat rate plan. You make a phone call, right? All of that information is being stored in databases and data stores, warehouses. If you use Netflix or you use Amazon video or you use YouTube, every time you watch a video that gets recorded. Every time you fast forward through a video that gets recorded. Every time you pause it, that gets recorded. They create heat maps of the video. Where do people back up and watch a video over and over again? You can probably figure out where. Social networking. So every time you make a friend request, every time you add a friend, every time you delete a friend, that gets recorded. Transactions, right? Everybody probably flew home this past weekend. Those airline reservations, those are transactions. You go to the ATM, that's a transaction. Use a debit card at a point of sale terminal. That's a transaction. All of that data is being recorded in databases. American Express, which is a company that offers a credit card directly to its customers unlike Mastercard and Visa, which have to go through banks, records every single transaction that has ever been made by every single card holder. That's a phenomenal amount of data that you can mine in very interesting ways. They can figure out who's about to get divorced by looking at their credit card records. And then they cut your credit rating because that causes problems. Network messages and faults. All of this gets recorded. And then people want to mine it for business intelligence purposes or for other purposes. Which ad do I show you? There's also a tremendous amount of user-generated content. All those Vine videos and Instagram and so on, YouTube, Twitter, Yelp, Facebook, all of this content was originally primarily generated on the web. Now, the vast majority of this content is being generated by mobile devices because you always have it with you. The Internet of Things is something that is relatively new and is really growing. This is M2M stands for machine to machine. So the idea is in the past, you might have a temperature sensor or humidity sensor that you put out in your garden and then you have a little receiver inside your house and it can tell you when the soil's dry and then you go out there and you water your plants. In the Internet of Things, your irrigation system is hooked up to that sensor. So when it's dry and the plants need watering, it automatically tells your irrigation system to water the plants. And we're gonna see more of this machine to machine type of communication coming along. Autonomous driving vehicles, those can use a form of machine to machine communication called vehicle to vehicle. So as you're driving down the highway in your automated vehicle, it's communicating with all the vehicles around it about current road conditions and so on, traffic and other things. And then the last area of really sort of phenomenal growth in data is, and it's a little hard to read this graph, this is a graph that shows the cost of computing, that's this blue line, versus the cost of sequencing DNA. And what you can see is the cost of sequencing DNA is dropping much, much faster than Moore's Law. So the sequencing side, that's the wet lab side of things, doing the PCR and the actual sequencing of the DNA bases. But then after you sequence those bases, you have to reassemble those into a genome. And that takes a lot of computer power. Why is this important? Well, unfortunately, everybody eventually gets sick or gets some disease. When you go to the doctor, they give you just sort of this standard treatment. Turns out that if I know something about your genome, I can craft a much, much better treatment plan for you based on your genome. And also, if you have a virus or you have a bacteria, understanding the genome of that virus or bacteria can allow you to deliver a much more targeted therapy. And so this form of personalized medicine is just taking off. But it's all gated right now by the cost of computing. Computers are making this very expensive. But if you look at the cost to sequence someone, originally a decade ago, cost a billion dollars to sequence the first person. Now it costs about $2,000 to do a whole genome sequence. In a few years, it'll cost about $150. Same cost as a normal diagnostic test. So you go to the doctor when you're sick, they'll take some blood and instantly give you a readout of your genome. All right, so all of this is generating data. Even more data is coming from all these users that are connected to the network. Using their cell phones, using social networking, browsing the web, producing massive amounts of content. The other thing that's kind of really interesting about this massive production of content is it's not like, if we look, it's like a nice steady progression. It's all exponential. So 80% of the content on Facebook, all those pictures of cats and things like that were produced in the last year. And next year, it's gonna be the same thing. 80% of the content was produced in the last year. And next year, it's gonna be the same way. Yes, very good question. So is it because more data is being produced or because of different types of data? Both. So Facebook allows you to upload videos. They allow you to upload pictures. Those are much, much larger content than status updates. But more people are joining Facebook and so every person who joins Facebook generates additional content. So it's on both sides. More users generating more interesting content and more types of content that people are uploading. And they don't put a limit, right? Facebook doesn't say you get 20 gigabytes worth of photos. They say upload your photos, right? And don't put a limit on that. The good news, storage is getting cheaper. So here's a graph that shows the capacity as a function of time and it's growing exponentially. However, data is growing faster. And, oh, this graph is unfortunately still messed up, but so here are some graphs that show the amount of data being generated by genetic sequencing, by particle accelerators, and overall the data being generated across the world. This is the Large Hadron Collider. It generates several petabytes worth of data every year. Most of the data, the vast majority of the data that's collected in the experiments is thrown away immediately. Like 99.999 something percent of the data. Now, this is kind of amazing because they're trying to find more evidence, I've already found it, but more evidence of the Higgs boson. It's like a one in 10 million event. And they're throwing away most of the data. So they have to be really, really intelligent about the data that they keep and about the data that they throw out. Because otherwise they could throw away that one in 10 million event and not have discovered the Higgs boson. Okay, so the challenge here is computers aren't getting faster. We're generating more data than we can store in a computer. So how are we gonna resolve this dilemma, this mismatch between the amount of data we're generating and the amount of data that we wanna process and look at and our capabilities? So the solution that's been adopted by web scale companies is to paralyze and distribute. So here's a picture of a typical internet data center with tens of thousands of machines. So rather than trying to go with how many cores can we fit in a single box, we just go many, many, many, many boxes. So this takes us into the world of distributed systems, which we've looked at already in this class, but basically it's where we have these loosely coupled set of computers that are communicating through message passing, trying to solve a common goal. And we know distributed computing can be really, really hard. We have to deal with partial failures. What's an example of a partial failure? Well, if I have tens of thousands of machines, machines are gonna be failing all the time. So if I have a computation that spans 10,000 machines, it has to be able to deal with the fact that occasionally one of those machines is gonna fail. And it shouldn't require me to start the computation all over again. That's the traditional model in the HPC world, is I have to load from a checkpoint and roll forward again. It's very, very expensive. Then we have to deal with a synchrony. Now where does this come from? Well, these machines are running at different speeds. When I'm buying, building a data center that has tens of thousands of machines and I have hundreds of thousands or millions of machines across the globe, I'm not going to like every three years just go out and buy another million machines to replace those. I'm gonna be constantly replacing them. Maybe every six months, I start buying the new version of the computers. So some of my computers are gonna be faster than others. So I need to be able to deal with that mismatch in performance. So failures and a synchrony are a huge problem. So what's the difference between traditional parallel computing like what you'd find up the hill at LBNL in the supercomputer center where that's a real big parallel computer? Well, it's dealing with things like partial failures. Supercomputers don't deal well with failures. Whereas distributed computing, it's all about dealing with the failures. Now, there's many different ways we can do our distribution of data and computation. So one way is by doing message passing. And this is the traditional approach that's used in supercomputers. They make message passing really, really fast. So when I have a partial result of my computation from an iteration, I use message passing to send it to other nodes. And that's a very, very fast operation. Another approach is to build some form of distributed shared memory. So it looks like they're all operating with the same memory. This is very expensive to implement and very hard to get it to perform well. So not a lot of people use distributed shared memory. Remote procedure call, that's another technique that we've seen in this class and is perhaps the most common approach that's used where we build the ability to do a procedure call on top of the ability to send messages and receive messages between hosts. So three different ways we can build our applications, but it's still really hard. Because I have to design my application to deal with scale, to deal with the fact that things can go wrong and to deal with the fact that I can have inconsistencies between my machines. Yes. So the question is what's the difference between a remote procedure call and message passing? So in the case of message passing, that's just send and receive. That's the only primitive I have. Send to one or more hosts and receive a message. Whereas with remote procedure call, I can do call foo on machine X. And it'll automatically bundle up my arguments, send those over using underlying send and receive. And at the other end, it receives that message, unbundles the argument, does a local procedure call, bundles up the result, sends it back. I receive the results, unbundle the results, and do a local return from that call. So from the point of view of a program, our RPC looks like I'm just doing a local procedure call. However, it can fail. So it's a little bit different, different semantics under a failure case than local procedure call. But the challenge is I don't want to have to write, so today I do my startup, I have 1,000 customers, and I've written my application. Now I have 100,000 customers when I go public. And so do I have to rewrite all my applications? What happens if I end up with a million customers? Do I again have to rewrite all of my applications? I don't want to have to. And again, when I've scaled up to a million customers and I've got thousands of machines, those machines are gonna be failing with some frequency. And so I don't want to have to rewrite my application to deal with that. So this sort of model of warehouse computing was really sort of summarized by two researchers at Google, Luis Andre Barroso and Erz Holtz. Their definition of a program is something like web search, calendaring, email, mapping. And a computer, in their terminology, is thousands of computers that are running these programs along with the storage and the network. So you have these warehouse-sized workloads and running on these warehouse-sized facilities. So that's very different from what we've been talking about all this semester. We've been talking about a program as like a process or a set of processes running on a single computer or maybe one or two computers. You're doing that right now for project four. You've got a couple of computers that are involved. Here, we're talking about tens of thousands of computers and we're talking about applications that have tens of millions of simultaneous users. And even better, Google is very frugal. So rather than building these warehouse computers out of ultra high-end server-grade components, they build, it's the bargain bin that they go and build their computers out of. So they try to build them as inexpensively as possible. But that means they're gonna fail more often. So we already had to deal with failures in this model, so failures are the new normal, basically. So we can ask the question, if the data center or the cloud is our new computer, what does it run as an operating system? So I'm not talking here about the host operating system, the Linux or the Windows or the OS X that it runs, but how do I manage the resources of a fleet of computers? Think about that. We're gonna come back to that. In a classic operating system, I have data sharing. We looked at inter-process communication, we looked at remote procedure call, we have files, we can communicate through the file system, we have pipes. And I can pipe from one program into another program. We have programming abstractions. We have libraries and system calls, implemented system calls earlier in this class. We have multiplexing of resources, so we have to schedule the CPU. We have to manage the limited amount of physical memory that we have that we wanna share across all of the processes for their virtual memory. We have the challenge of file allocation and protecting those files from different users. We have many of the same problems in the cloud. In the cloud, we have to deal with sharing. So we have abstractions like the Google file system and key value stores. You're implementing a key value store. You did one for project three, you're gonna do a distributed one now for project four. We have programming abstractions like MapReduce and Pig and Hive and Spark. I'm gonna go through these later on in the lecture. We also have to deal with the multiplexing of resources. We have multiple applications that wanna run on our 10,000 node computer. So how do we divide up the resources between the different jobs and the tasks that those jobs have? And there's a number of projects. I'm gonna talk about mesos, but we already talked earlier in the semester about ZooKeeper. Every once in a while, we have these fundamental paradigm shifts that happen in computer science. This was really, the latest one was really launched by a paper written by folks at Google in 2003, describing the Google file system. And this was followed up in 2004 by a paper that described MapReduce. So the Google file system gave you this distributed cluster scale file system with a single namespace. What's unique about that? It's sort of first pass, it doesn't seem very unique. Lots of people build distributed file systems that give you single namespace. The difference is this was a file system designed for massive scale to run on thousands of machines, not dozens of machines, and to deal with failures as a norm. MapReduce, same thing. Nothing new about a parallel programming language. What makes this unique is its ease of use, its ease of scalability, its dealing with fault tolerance, its dealing with asymmetry in the resources that machines may have. All of that, together with the file system, gave us this new programming paradigm that's kind of taken off like wildfire. Now, it hasn't taken off like wildfire because of Google, but rather because the Apache open source foundation has created open source versions of Hadoop and Hadoop distributed file system, HDFS and Hadoop MapReduce. Yeah, question? Yeah, so that's a good question. Is the single namespace a new thing? Was that not possible? No, there were lots of distributed file systems that gave you a single namespace. What made HDFS unique? Scale. Google publishes this paper and they talk about manipulating petabytes of data, tens of petabytes of data a day. At that time, most enterprises had maybe tens of terabytes. Hundreds of terabytes was like a big, massive number. You had to spend a lot of money to build a file server that could serve up 100 terabytes. And now, Google's talking about do-it-hers-or-magnitude-pass-that, right? Tens of petabytes worth of storage. How did they do that? They used really inexpensive disks. Each disk on a machine that's in their 10,000 node clusters. And they take files and they stripe the files, these 128 megabyte stripes across all of those machines. Now, they pick 128 megabytes and you should know why because we've got mechanical disk drives and these big block sizes will give us high throughput for doing sequential reads and sequential writes. Remember, that's how we maximize the performance of a disk is by doing large reads or large writes. If we made it smaller, the overhead of seeking would start to dominate. So we take this data, stripe it across hundreds or thousands of servers. So if I tried to read 100 terabytes on one node and I could read it at 50 megabytes per second, it would take me 24 days. If I stripe that instead on a thousand node cluster with all my redundancy and overhead of communication on the network, I can do it in 35 minutes. So now if I wanna do something like 100 terabytes, I can do that really fast on my cluster whereas it would take me days or weeks to do that on a single node. Yeah, question? So some more insights. So in addition to massive scale, it's a recognition that failures are the new norm. If I have one node, it might fail once every three years. Your computer might die. If I have a thousand computers, we're talking about once a day, I have a computer going out of service. So GFS had to deal with the fact that you need to replicate because nodes could go unavailable as they fail. They're using commodity hardware. So if you're gonna make failure be the norm, take advantage of it, buy the cheapest possible hardware out there and that's what you use. So the last thing is very simple consistency model. Nothing complicated about it. You only get a single writer with GFS and HDFS and it's a pendulum, so data's immutable. So you can write and then you can't rewrite the data. Very simple versus the much more complicated models. Think about AFS or NFS. That consistency comes at a cost. So here, simple consistency makes it very simple to implement. MapReduce also brought a number of key insights. A very simple model, just keys and values. Take a fine-grained operation, a map or reduce and apply it in massive parallel to keys and values. But this requires that the operations be deterministic and MapReduce also requires that the operations be idempotent. They can't have any side effects. The only communication that you're allowed to have is through the shuffle. You can read from HDFS, compute locally, write locally, shuffle and then write to HDFS after you're done with the reduce. And these outputs from the maps and the reduces get saved on desks. Maps get saved locally, reduces get saved into HDFS. So how people use this? Everywhere. For Google, the index is built using it. Articles for Google News are clustered using it. Statistical machine translation uses it. At Yahoo, the search index is also built using it. Spam detection, so fraud detection is done using it. Facebook uses it for data mining, ad optimization, that's how you make all your money. And for fraud detection, spam detection. So advantages of MapReduce. Why is it that MapReduce, everybody uses MapReduce. When MapReduce first came out, it was the kind of thing that grad students would play around with. Now, we teach it in the lower division. It's become a standard kind of toolbox, toolkit thing that you go to. Distribution is completely transparent. If you look at a MapReduce program, there's not a single line of code that deals with the fact that it's distributed program. It's completely hidden from the user. You don't have to worry about, this makes it much, much easier and you don't have to worry about getting it correct. It's really hard to write a program using the message passing interface and get it to be correct. People spend a lot of time up the hill at the supercomputer center, writing their programs and debugging their programs. MapReduce, much, much easier. You get automatic fault tolerance. Because the tasks are deterministic, we can just run them again. And because they're side effect free, we don't have to worry about repeated executions. We can also save the intermediate data. And so that allows us to just rerun a failed reduce. We save the intermediate data from a map. Automatic scaling. Because there's nothing in your code that says how many nodes you're running on, we can run on whatever the number of nodes is that the system has. If it has one node, we run on one node. If it has 100 nodes, it runs automatically on 100. If we have 10,000, it runs on 10,000. If we have 100,000, it runs on 100,000. No changes required. Doesn't get any easier or any better. Automatic load balancing. So remember I said, you're buying this equipment in waves, so some machines are really fast and some are really slow. Well, if half of my job is running on fast machines and half is running on slow machines, those slow machines are gonna gate the computation. They're gonna make it run overall slower. Speculative execution in MapReduce allows me to launch new tasks on those fast machines. And then whichever one wins first, we kill off the other one. Yes. Yes, the question is, does it optimize on the overload of how many nodes you wanna run it on? Yeah, there's been a lot of work on how to detect when you have a straggler and determine when you should launch new tasks speculatively. We had some research a few years ago on a better algorithm for doing that. This is one of those things where there's a lot of tuning that goes into it of how aggressively do you launch new tasks speculatively versus how aggressively do you launch new jobs? And there's a trade-off, right? You wanna get the jobs through the system as fast as possible, but you also wanna run it as many jobs in parallel as possible. So this may make you think there will, yeah, question? Yes. So that's a very good question and I'm gonna cover that in just a moment. So the question was, does locality matter? And the answer is absolutely. Locality matters, going across the network to get a value from HDFS on another node is very expensive and you wanna avoid it at all, whenever possible. And some applications, some scheduling environments let you do that, some don't and we'll look at some of the research that my group has done that lets you do that. So this might lead you to think with all these great pros and benefits of MapReduce that we're done, but the world's not that great. So there are some disadvantages to using MapReduce. The biggest disadvantage is it has a very restricted programming model. And that means it's not always easy to express problems in this model. So graph problems can be very hard to express in this model. Machine learning algorithms that are iterative where you're repeatedly doing some computation to try and get to convergence can be very expensive. You can implement them in MapReduce but they're gonna be very, very expensive. You're also doing low level coding. You're having to worry about the format of things. These values are formatted objects that you may have to marshal and un-marshal to use them. And as I said, there's not very much support for iterative jobs. You end up doing lots of disk access and we'll see this in just a moment. It also is very expensive to launch a new task, a new job rather. And so this really kind of makes MapReduce better suited for batch types of jobs. So of course, computer scientists love ACON because that launches a research project. And so there are a bunch of research projects that have come along to try and address some of these issues. So Pig and Hive, which we're gonna look at in a moment, give you a much higher level coding model. That underneath, it's using MapReduce but at the high level you're writing something that's very easy for a programmer to understand. And Spark is a framework developed here at Berkeley that works well for iterative and low latency types of jobs. So let's start with Pig. So Pig is a high level language. It allows you to express sequences of MapReduce jobs and it gives you a very SQL like model. So you can do things like join and group by and so on. Very easy to plug in arbitrary Java functions and it was started at Yahoo Research and it runs about 50% of Yahoo's jobs. So how does this work in practice? Let's look at an example. So here's a question. Business intelligence person at Yahoo might wanna know the answer because they're gonna try and sell ads or figure out which sites should have the most expensive ads on them. So given user data in a file, so this is information about your users. And given website data, so this is click data, click traffic, find the top five most visited pages by users aged 18 to 25. It's my target demographic. So I would do a, so this is what it would look like if I wanted to write it sort of database style. I load my users table, I load my pages table, I filter my users by age, so 18 to 25 and then I join these two tables, pages, page visits and users by name. Okay, because I have the name associated with a click. That's what cookies and other things get you. And then I'm gonna group by URL. So I gather up all the same URLs. Then I wanna count the number of clicks associated with each URL, so the number of times it appeared in pages. Then I wanna sort that by clicks and then I wanna take the top five. So not a really difficult task to express. Let's write it in MapReduce. What would that look like? That would look like this, okay? I'll give you a moment to read it. Now, the point is you can do it. But this is not what you want your typical business intelligence analyst to have to write. Because they're gonna probably get it wrong. What does it look like in the programming language for Pig, which is called Pig Latin? This, very simple. There's exactly the steps that I just talked about, right? You load the users that's name and age tuple, you filter the users by age between 18 and 25, you load the pages as user and URL, you're gonna do a join across that, filtered by name and pages by user. And then you're gonna join all the URLs that are identical and then for each of those groups of URLs, you're gonna count the size of that, the cardinality of that, then you're gonna sort that in descending order and then take the top five and store top five into top five. And each one of these matches up with one of those natural sort of database-like things that a business intelligence person could understand. They took a databases course. And this translates also nicely into a set of MapReduce jobs. So there's a job that will do the loading and filtering and joining on the name. There's another job that will do the grouping on the URL and counting the clicks. And another job that will do the ordering, the sorting of the click data and then taking the top five from that. Very simple, very straightforward. But this is the kind of interesting opportunity that you have is when someone creates a new paradigm, there's always gonna be deficiencies in it. And then you can come along and create something on top of that. This is an open source project. Another project is Hive, developed by Facebook and it's basically a relational database built on top of Hadoop. So it gives you table schemas, gives you a SQL-like query language, which can also work with Hadoop streaming support. And it supports all the typical types of things that you'd like to do in a database, like table partitioning, complex data types, sampling, and limited degree of query optimization. And this is what's used for many Facebook jobs. So I could keep going. There's all sorts of different data models that people have built on top of Hadoop. Cassandra would be another one that's been built at Facebook on top of Hadoop. So Spark is a project that was done here at Berkeley, led by Matej Zaharia, who just graduated. And he was a former grad student. And he saw an opportunity for research in the following problem. There are lots of complex jobs or interactive kinds of queries. So I'm sitting at a terminal typing in queries that we wanna run against this petabyte scale data. And MapReduce has a problem. It doesn't deal well with data sharing. All the data sharing in MapReduce goes through the file system. So you're constantly reading data in, deserializing it to convert it into a memory format, processing on it, serializing it, and writing it back out to disk. And that serializing operation, especially in Java, is very, very expensive. But reading and writing to disks is also very, very expensive. So examples of the computations that he saw as problematic are iterative computations. So I read in from disk, I perform a computation that the results of that computation feed into another stage, which feeds into another stage, which feeds into another stage, and so on. Interactive data mining. So this is sort of the what if questions. This business intelligence analyst is sitting in front of the computer posing questions. They want to see the top visited sites by 18 to 25. Then they want to see 26 to 30. Then they want to see by gender. Then they want to see by geolocation. Then they want to see, so they're just sort of exploring the data. So lots of queries drawn from this data set. Stream processing. So this one is really appropriate for today. So Amazon has historical data on how people use their website and what the top products are that people are searching for and buying. Today is Cyber Monday, one of Amazon's biggest sales days in the year. So continuously, they'd like to be running queries that after they've loaded the initial data set are updated by all the click traffic and sales traffic that's occurring throughout the day. So they can figure out, what should we do in our next lightning sale or something like that. So these three different kinds of computation models are not a good fix for MapReduce. Because the only way to communicate between these independent, these related jobs or related queries or related stages is through the file system. And that's gonna be really, really slow. So if we look down a level, what's going on here? So here's our iterative computation. So we do our HDFS read of the input, then we do an iteration. At the end of the iteration, at the end of the reduce, we write back out to disk. So we have to serialize and then send off three copies to store in HDFS. What's gonna happen immediately is the next iteration is gonna start up, read that data that we just serialized out to disk, back into memory, deserialize it and compute on it. And then it's gonna do the same thing of writing the data back out. And we're just gonna keep doing this. Same thing for the query. Every time we do this query, we're gonna have to pull all the data, pull that petabyte of data into memory, deserialize it, compute on it, and then throw it away. It's a lot of wasted work. So what Matei noticed is back in 2003, memory was really expensive. But now, 2010, memory is really cheap. So why not use main memory as a cache and store that intermediate computation or that intermediate state that we've loaded from the disk in memory? So now the problem looks like this. So after we finish an iteration, the results are sitting there in memory. So we just let the next iteration read those results. Same thing here. We do one-time processing to read that petabyte of data into memory. And now, each of these queries can run locally. So back in 2003, the typical amount of memory that someone might put in a server was two gigabytes. Maybe four, if you had a lot of money. Today, people build servers with 128 gigabytes, 256 gigabytes, 512 gigabytes. The cost of memory is really plummeted. So why not take advantage of it? We're looking at sizes of main memory, equivalent to the size of the disk drives that were in those machines back in 2003. If you do all of this, you can make it run not just an order of magnitude faster, but potentially two orders of magnitude faster. 100 times faster. So now, really, you can be sitting there at your terminal, browsing through a petabyte of data and running queries on a petabyte of data and getting real-time results. That's pretty amazing. Now, how does he actually make this work? So he takes the data and he partitions these records that we've read in from a disk across all of the machines in the memory of the cluster. And then they're manipulated by doing maps and reduces and filters and joins and so on. Now, the problem is, this is commodity hardware, which means it can fail. If it fails, now we lose this intermediate result which was stored in memory. For a particular node, it's like losing one of these memory chips. So one solution would be to copy that to another node. But copying it to another node is really expensive. You gotta use the network, you have to serialize it. We were trying to avoid serializing it in the first place and at the other side, deserialize it. So instead, because all of these operations are idempotent and deterministic, he just remembers the lineage. How did I get to this particular object? So that would be like, if I lose this memory chip, all I have to do is remember, well, I read from disk and I performed one iteration and that created this particular memory chip on a machine. Or if it was this one, I might have to recover from this one and run the iteration again. So very simple. Treat memory as a cache. So this just entered incubation with Apache, so it's open source, you can download it. A lot of people have actually started downloading it. This is a research project. So this is from last year, January 2012. There were 1,000 enthusiasts that showed up for a spark that were part of the Spark Meetup group. They held a meetup at Twitter and they actually had to call security because so many people tried to sort of gate crash it because even though they hadn't signed up for the meetup, that it was standing room only. Matej has moved to Berkeley, to Boston rather. He is an assistant professor now at MIT and there's now a Spark Meetup group there. There's even a Spark Meetup group in Hyderabad, India. So there's a bunch of startups there where they've recognized the value of Spark and so now there are 58 enthusiasts there that are using Spark. Question? Yeah, exactly. So the question is, if Spark is designed for these iterative kinds of applications, are there frameworks that people are developing designed for sampling the data for statistical calculations? Yes. There's a project in the AMP Lab that's called MLBase that is looking exactly at that hit and run sampling kinds of techniques that you can use to sample a small portion of the data and then determine when your computation has reached some epsilon of the optimal bound. Okay, any other questions? So a takeaway here is for those of you who are interested in going to grad school, it's possible to have pretty phenomenal impact as a grad student, right? To have thousands of people that are at hundreds of companies that are going and downloading your software and using your software to create the next big startup. And he's also actually created a startup. There's now a Spark startup. Okay, so some administrative stuff. We have a design doc do today. We shifted the date to give you a little bit more time to focus on the midterm. Code is due next week on the 12th by midnight. And our second midterm is Wednesday. So a couple of very important things. It's at this time, if your last name is A through L, you'll be in this room. If your last name is M through Z, you'll be in 2060 Valley Life Sciences building. So that's same as before. The midterm will actually cover lectures 13 through 24. I had said 14 to 24, but it's actually 13 to 24. And it covers the relevant projects and any of the readings that were signed during that period also. You can bring a one-sided handwritten sheet of notes or one page double-sided, rather, handwritten sheet of notes. Office hours, tomorrow I will have an extra half hour of office hours. So from 10 to 11.30 in the same location for 49 soda. And the TAs have also posted their office hours for this week. We're still finalizing office hours for next week. So we'll probably have another one of these big project office hour sessions on Wednesday. Probably sometime during the day. Any questions? All right, with that we will take a five minute break. Okay, so let's get started again. So one of the roles of a cloud operating system is scheduling. And scheduling in a data center is a really big problem. There's lots and lots of innovation that's been going on in data center computing frameworks. So for example, we looked at PIG just recently, but there are many other frameworks that people are building. Why? Because there's no framework that's optimal for all applications. So there's always gonna be a set of people who wanna run their MPI applications. You know what supercomputer people use. Or wanna use Rails or wanna use MapReduce. So we need to be able to support these frameworks in a single data center. Why? Because we wanna maximize the utilization of that data center. A data center, an internet scale data center can cost anywhere from a half a billion to a billion dollars, very, very expensive. So I want that every single machine in that data center to be running and utilized. We also wanna share data between frameworks. So we might have an application that's looking at click traffic to detect fraud. That might be written in MapReduce. And it might have another analytical application that's written in Pregel. I don't wanna have to have two copies of my petabytes of data, one for the MapReduce jobs and one for the Pregel jobs. Wanna be able to share that data between all these different frameworks. So where we wanna go is from this model of static partitioning. So this is what they do in super computer centers. They divide their machines up into what they call lanes. So they have a lane that runs Hadoop, they have a lane that runs say Pregel, a lane that's running our MPI jobs. So we have a data center in this case with nine machines and we've divided up into three lanes of three machines. Now the graph next to it is utilization. What do we see? Well, we see in the Hadoop case, looks like there was a job that ran here and then finished. Then there was some idle time and another job ran and finished. In the Pregel case, we just had a single job that ran and then finished. Idle time before, idle time after. In the MPI case, we had a job that ran, it finished, a little bit of idle time. Then we had another job and then we had a second job that came in. Key thing that you should notice from looking at all these graphs combined, this is a lot of white space. White space means a machine that's sitting there, not doing anything. I spent a lot of money to build a data center. I spent a lot of money to buy the machine. I'm spending money to cool the machine. I'm spending money to power the machine. In fact, the energy costs for running a machine over a three year period exceed the cost of buying the machine. So I don't wanna have a machine sitting idle. Where we'd like to go is this picture on the right, where I've got one data center with my nine computers in it and all my jobs can run across all the computers. Now what are some of the benefits? Well immediately, notice there's a lot less white space. So I've got higher utilization overall. Less idle machines, less money that I'm pouring down the drain and wasting. But there are other very important things to notice. For example, look at how long the Prago job runs for over here. Versus it runs much faster over here. So I get better responsiveness. My users get their answers quicker, lower latency. And that's because here at most, Prago can use a third of the machines. Over here, it hits closer to like 80, 90% of the machines. Other benefits, if I'm able to coalesce my workloads, run them on the machines together, I can buy fewer machines. So now my data center costs less. I provision it for less power, less cooling, I make it smaller, I buy fewer machines, and I have a smaller power bill at the end of the month. So we get all these benefits. The way we're doing this is a project that we developed here called Apache Mesos. And Mesos is basically a common resource allocation framework that runs underneath these different diverse frameworks. So before, we ran Hadoop directly on the nodes. We partitioned off to have a set of Hadoop nodes. And partitioned off to have a set of Prago nodes. Now we add this indirection layer that's Mesos. It allows us to run multiple instances of the same framework. So I can have my Hadoop that is running my test jobs or running my development cluster workload, sitting in the same cluster that's running my production jobs. I don't have to build two or three different clusters. I can even have different versions. I can have Hadoop 19 and Hadoop 20 running at the same time. So that's a lot of benefits. And that's in fact one of the things that a lot of companies like is this ability to start putting out new versions of a framework on some of their workload against it, see how well it does. And then when they want, just flip the switch and now all of the cluster is running with the latest version of a particular framework. I can also build specialized frameworks. So this gets back to that rapid innovation issue. People want to create a new framework like Spark. It's trivial to do and deploy when you're doing it on top of Mesos. So goals that we had in the Mesos project were to deliver high utilization of the resources. We didn't want idle nodes to support many different kinds of diverse frameworks, so current frameworks and future frameworks. We don't know what the next grad student or undergrad's gonna think up as a new framework. So we want Mesos to be as flexible as possible in enabling future frameworks. Scalability has to scale to these internet-sized data centers, tens of thousands of machines. And reliability in the face of failures. We actually use ZooKeeper to help us with reliability. So Mesos went into incubation, I think it was in December of 2010. We just graduated from incubation this year. We're now a full-top-level Apache project. So you can go and download the code as many other people have. The result of all these goals is we have a very small microkernel-like design. So we have limited functionality. Simplified it as simple as possible and pushed all the complex logic of figuring out what resources to allocate to a particular framework. We pushed that decision-making up to the framework itself. Let it make the decision. So there's two key design elements in Mesos. The first is fine-grain sharing. We do allocation at the level of tasks within a job. And this allows, as I'll show in just a moment, us to improve utilization, latency, and that question that we had earlier in the class, data locality. Second design element is resource offers. And it's a really simple way of pushing the complex decision that you have to make in scheduling to individual frameworks. If we can remove that complex logic from the scheduler, the scheduler becomes really simple. All it has to do is decide what fraction of the cluster gets allocated to each of the frameworks. And then the frameworks can decide which of the nodes they want. All right, so let's talk about fine-grain sharing. So on the left, we have that model we talked about, the HPC model, where we have coarse-grain allocation of resources. We have a lane for framework one, a lane for framework two, a lane for framework three. One of the things to remember is we're using commodity machines with local storage. So we have HDFS. The challenge we have is what happens if the data that framework one, this task from framework one needs, is located on this machine. It has to do a remote procedure call down to this machine to get the data and have the data sent back. So that puts load on the network for the request and in particular the response. And it interrupts whatever this machine was doing. So it's very, very expensive to do a remote read. Over here, in the mesos world, we break things up into tasks. And the tasks are like little pieces of sand or grains of sand that we fill in on all the machines. And that gives us the ability to put the task right local to where the data is. So now we can do a local read instead of having to do a remote read from HDFS. The second part of this is that as things finish, we can now schedule another little task to take over. And we can just keep doing this, right? Substituting in as one thing finishes. If you do all of this, you get better utilization, better responsiveness, so lower latency, and you get data locality. So we get all three benefits. The second element is resource offers. So one solution would be have a global schedule and applications express their preferences for nodes. So they could say, oh, I wanna be on this node, on this rack, or on this rack, but I also wanna be on this other rack. But I also wanna make sure no other tasks get scheduled on the same rack. And I want four cores, and I want a machine that has two SSDs, and I want a machine that also has a GPU, so I can do GP-GPU computation, and so on. And so you can create these really complicated arbitrary specifications for resources. That makes it very complex. The plus is you can make an optimal decision, because you can see all the requests that come in and pick which one you wanna use. But the complexity makes it very difficult to scale that scheduler. Makes it very difficult to get it correct. If it's not correct, it's gonna crash a lot. And if it crashes a lot, that impacts your 10,000 node cluster. So that's a problem. Also, a new framework comes along and decides it wants a new particular kind of resource. Maybe it wants a certain number of cores per socket, and you've never thought to have that in your specification language before. Now you have to add it to your specification language, and maybe you have to go back and change the request from the other frameworks also to deal with this new specification. So for these reasons, we decided this was not the way to go. Instead, what we do in Mesos is very simple. We make a resource offer. We look at all the resources that are available in the cluster, we bundle that up into a list, we offer the framework, here are the resources. Pick the resources that you want. This makes Mesos really simple, and we can support new types of resources. They just become a new thing that we add to our list. We don't have to know anything about the semantics of the underlying resources. The downside is decentralized decisions aren't always the best. Now, there's a very real world analogy that will make this very easy to understand. For those of you who flew this past weekend, you got on a plane, right? And when you booked your ticket, the airline gave you a seat map, and it showed you the empty seats. And you decided, you know, I want to sit up front, I'm gonna pay the extra money, or I want to sit in the far back because it's quieter, or I want to sit in an aisle seat, or a window seat, or I don't want a bulkhead because that's where all the babies are, or, you know, whatever. The airline doesn't care. They know nothing about your preferences for where you want to sit. That's all in your head. You get to apply whatever complicated logic you want to where you're gonna sit on the plane. That's exactly what we're doing in mesos. So really quickly, because I'm running out of time, here's what happens in mesos. We have a mesos master that keeps track of all of the resources, and we have mesos slaves which run on each of the nodes in the cluster, and then we have frameworks, like an MPI framework and a Hadoop framework. When we have available resources, those nodes communicate to the master that they have free resources. It picks a framework to give those resources to, constructs a resource offer that contains a list of the available resources, and sends that up to the framework. The framework applies its complicated scheduling logic, picks what it wants for nodes, and generates a task that it sends back to the master. The master then creates an executor on that particular slave that will run the task and isolate it from all the other tasks that might be running on that node. And we have some free space, so maybe we create a new resource offer and pass it up to a different framework. So this is actually in production. Its biggest deployment is at Twitter where it's running on many, many, many thousands of nodes, running dozens of production services, servicing over 100 million people every day. It's also used at UCSF, at Yahoo, at a startup, and many other places, including locally at Berkeley. Yes, question? Ah, so that's a very good question. So do we have a notion of data in this? We don't have an explicit notion, but you can always look in HDFS and find out where the data is stored, and then request nodes that are local. Okay, so in summary, we've moved from our computer being a single node to our computer as an entire data center. And for that, we need a cloud operating system. And we're starting to see bits and pieces of that. There's still a lot of work to be done. Spark was a project done by grad students. Mesos was a project done by grad students. So this is a plug for going to grad school. You can have massive impact. You can have your code used by tens of millions of people, even through just a research project. Many, many components go into a data center operating system. High throughput file systems that can tolerate failures. Frameworks. This is an area of just tremendous innovation because people are always coming up with a new way to process data. MapReduce really kind of catalyzed this area and lit a flame under it. Lots of high level query languages that people build on top of MapReduce. There will undoubtedly be more. And then finally, scheduling the resources in a cluster. So there's Apache Mesos. There's also Apache Yarn, which is the next generation of Hadoop, MapReduce. So with that, any questions? Okay, well thank you very much. I hope you guys have enjoyed this semester. And good luck on your project and good luck on the second midterm.