 Hey everybody, can you hear me? Raise your hand in the back if you can hear me, okay? Yeah, all right great We have a full house, but I think there may be some seats left raise your hand if you're sitting next to an empty seat Raise it high. There's one There's one Okay, somebody comes in so here's the bad news good news bad news This course is wildly oversubscribed which is which is great because it's a good topic, but bad because you can't all get in What we did is we convinced the campus to let us over enroll this course The rules are you can only have ten percent more people enrolled than seats in the room We got them to blow that away. So we admitted another I don't know 30 people I think so we're up to like three 40 or 350 people admitted to the class and there's 297 seats in this room and the only way I convinced the campus to do this is if they took they if I Agreed like I signed a form that said I will kick out people who do not have a seat because otherwise the fire marshal will shut Us down and the part of that deal is that the class will be webcasted So until such time as I bore you guys enough that like 60 you don't show up to class It's gonna be kind of first come first served So that said I am gonna have to ask those of you who do not have a seat to get up and go away and Have a you know have a coke or something on your way out though before you leave before you leave on your way Oh, also if you're not enrolled if you're auditing Please give up your seat to someone who's who's enrolled So Final point before you leave if you are enrolled and you're taking the class homework is being passed out today It is due Thursday that's in two days to do that homework. You need a piece of paper from the folks the TAs at the door to get a course Account, so please on your way out pick up a course account and then information on the homework will be on piazza Lectures should be online shortly my apologies Alright everybody, let's get going So I apologize once again, but we should try to we should try to jump in Thank you all for being early and eager and pushing your way into the room and getting seats. That's great All right, so for those of you who are watching on TV. We love you too Let's get started. So today any class that you take, you know, the professor really should answer these questions at some point Why take this class? And what the heck? What will we be learning about? I forget even what who's about like who are we I think is what that was about How's the class gonna work? Logistically and then for instance, I'll give you just a little bit of a taste of the kinds of things We're gonna be learning about this semester a little flavor just enough to get you in trouble with your homework And that's kind of be lecture for today So it's gonna be rather light today the rest of the semester obviously be more technical But today we're gonna be kind of on the light side. Okay. Why why take this class? well You know, I've been well first of all a little note of history the first lecture delivered in this room was this class and it was me And that was three years ago. It's kind of cool. So I love this room And three years ago I could feel this happening and now it has happened But when I started teaching database at Berkeley in the mid 90s, I'm pretty old Like there was nothing interesting about this topic. You know, I'd be like data. It's really cool There's all this really interesting computer science and students would be like, hey, can we go like? You know program video games and figure out what that web things all about we're not really interested in data and we want to do some computing we want to compute on some stuff and Three years ago when I taught in this room for the first time It was pretty clear to me and in fact if you look at the slides from three years ago It says five years from now Data is going to be huge and you're all going to be employed working on data We're we're well into that So I don't think I need to convince you so much that data is going to be at the center of many things In fact, frankly data is pretty much at the center of everything It's it's almost hard to think about our world now without thinking about the ways in which we can quantify it and then take advantage of that quantification There is a mobile mic if you want to come on up and All right, can you guys keep it down, please? Thanks All right So, you know Things like money ball if you saw the movie or read the book obviously data made the Euclid is great This was an interesting issue a wired some years ago, you know wire It's kind of over the top But still they called it the end of science and what they said was the quest for knowledge used to be good with grand theories now It begins with massive amounts of data welcome to the petabyte age which already kind of ages it right? It's like petabyte, you know kind of big not that big So that's like four years ago And then the Economist which is you don't really mainstream media started talking about this data deluge and big data in the early 20 teens okay, and by now this is like everybody gets it You know you can talk to your grandma about big data, and she sort of at least has heard of it So this is a thing. We all know it's a thing some numbers I tried to get up to date numbers But actually the best numbers I could get I'd like how much data is there in the world? We're from this IDC study a few years back 2011 so at that point they were saying doubling every two years 1.8 zettabytes we're going to be created in 2011 and to give you a sense of that infographics were really cool in 2011 1.8 zettabytes would be like every person in the United States tweeting three tweets per minute So that's four thousand three to twenty teeth per day per person for twenty six thousand nine hundred seventy six years It's kind of silly, but it's a way to get your hand around it or you know two hundred billion HD movies It's kind of the equivalent of how much data was going to be generated in 2011 okay And if you stacked it up in 32 gigabyte Apple iPads it would build a mountain that was 25 times higher than Mount Fuji So that's pretty cool. So there's a lot of data and that gives you kind of a feel for it. This is only growing all right So yeah credit where credit is due Here is the the most that they published this year not as much entertainment, but we're up to 4.4 zettabytes in 2013 And they're predicting getting up to 44 zettabytes in 2020 so there's going to be a whole lot of data now. They're stacking up tablets to the two-thirds of the way to the moon I don't know how they figure out how to do these things, but Like what's the best analogy, but whatever it's a lot of data. Okay, there's a lot of data out there So what the heck is going on? Where's all this data coming from? Well from the internet I guess Okay, but really what do we mean? So you know If I set all the people in the world typing at keyboards as fast as they can We wouldn't generate that much data like Twitter actually isn't that much data Okay, so you get everybody typing really fast. There's four billion people. It's not going to generate that many bytes All right, some okay decent amounts, but not that much. There's other stuff going on one thing that's going on Is there's lots of copies of the same stuff? Basically movies and when you measure internet bandwidth usage It's for sure just you know a small number of things being sent too many times to different people, right? So that's all about staging that data out and caching it and putting it in different places to reduce bandwidth usage But storage-wise a lot of the world's Magnetic storage is being taken up with like copies of movies. Okay, so that's some of it And that's actually not that interesting at some level at least not today. We don't it's not that much unique data We don't do that much with it, but that's part of it Okay, but what's really kind of interesting is what I call the industrial revolution of data This is happening under our feet where basically data is being stamped up by machines data used to be crafted by people Typing at keyboards or people making movies or making music creating content. Okay, but now data is increasingly be generated by machines Primarily these days software logs, right? So all our software is spitting out logs about what's happening with it That software is in the center of many of the systems of the world commerce and government and so on and so that just generates a Lot of data about human activity and then there's like physical stuff So RFID tags started happening You know, that's how you get into Soda Hall with your swipey card GPS started happening and we started talking about the Internet of things Which is something people at Berkeley were working on like 15 years ago at least we called it sensor networks But it's really happening out there now more and more Quantified self is a piece of that people are starting to measure their own bodies And then high bandwidth stuff like microphones and cameras are just generating lots of data Based on what's going on in the physical world So this data is being stamped out and formatted by machines. It's not like, you know Shakespeare writing down and generating all this data All right, and that changes the kind of data we have and it changes the Coverage of things that we have data about all right, and it changes what we can do with that data And so, you know today's probably the only day we'll take the luxury of asking some of these questions but you know you might ask the question How does this make you feel if everything gets measured and stored and indexed and machine learned and predicted? Well, I don't know. Let me give you some some food for thought on that front, okay? So some years ago Well, so there's this right start with this information is knowledge. So, you know, this is good clearly Mankind has always wanted more information. This is a good thing Albert Einstein said that actually that's pretty cool Knowledge obviously is power so Francis Bacon said that so this is good And then of course as you know with great power comes great responsibility, which was said by Spider-Man's Uncle Ben, right? So I think you know There's a subtext to this whole class and there's a political science class I think right before us right there's a subtext to this whole class about what we as engineers are enabling and what technology is gonna bring us That I think you know merits at least a little discussion on the first day of class Okay, so some years ago back in the mid 2000s I was working with the startup in the city and they were hosting public data on the internet and auto-generating charts and creating sort of Blog style discussions about the data and one of the statisticians at the organization for economic cooperation and development Which is a European multinational? Organization he got really excited about this and this is what he said about the idea of having data online This is like deep in the Bush era by the way So little political context With the collaborative spirit with the collaborative platform where people can upload data explore data compare solutions discuss the results build consensus We can engage passionate people local communities media and this will raise Incredibly the amount of people who could understand what is going on and this would have fantastic outcomes the engagement of people especially new Generations it would increase knowledge unlock statistics improve transparency and accountability of public policies change culture Increase numeracy and in the end improve democracy and welfare awesome Okay, so certainly you could do lots of good stuff with data And I think increasingly one of the other things he's a statistician So not surprising one of the other things he's hinting at is that the ability for large numbers of people to reason about data Is kind of key to a good society of the 21st century, okay? So I think for for all of us as engineers computer scientists potentially educators mentors of people as you grow I think there's a lot of hope and positivity that you can Engage in in the ability to get data And so you look at people who are trying to do like sun lighting Public data or trying to organize data gov or things like that There's a lot of public good that can be done by just basically increasing transparency of what's going on in the world through data and tools around data Okay, so definitely some big positives and I put that first because you know, there's lots of negatives, right? They're coming But there is a lot of positive to be done here, and I think you got to keep your eye on that and hopefully help with it So Stephen Colbert has this awesome word of the day thing I don't know if he still does this anymore, but some years back he got all excited about Wikipedia Did anybody see these? He had Jimmy Wales on Colbert. It was awesome So he defined this word called wiki ality like reality, but it's wiki ality And he said together we can all create a reality that we can all agree on the reality We just agreed on and so to illustrate this he went into wikipedia and he edited the topic on elephants to say that there was an overabundance of elephants in Africa and that people should start hunting them and Then he said nation Colbert nation You have to keep this page up because people could just change it back And so TV watchers were continually updating the elephant page to say there were too many elephants in Africa so this created kind of a thing and then Jimmy Wales who's the founder of wikipedia came on the show and That was interesting Jimmy Wales so Colbert asked Jimmy Wells, you know, I hear you have wikipedia in lots of different languages And Jimmy will say yeah, we have Spanish in this and that and Colbert says shouldn't that Spanish wikipedia be in English? And Jimmy will sit down and man now we're going to shut down the entire Spanish wikipedia and lock it down because you're going to change it So you know definitions will welcome us as liberators. So what does it mean when you let anybody update the official? You know knowledge of the world. It's kind of fascinating It's maybe not that fascinating to us when wikipedia was just happenating happenating Happening it was a little more interesting But still there's some crazy stuff on wikipedia and it gets actually kind of interesting when you try to scrape data out of wikipedia and not just the text but the the data in the info boxes So here's one I ran across in a in a conference talk So this was a conference talk on a knowledge base that's being built in Germany called Yago She's used in a lot of AI projects and they're scraping wikipedia to get data about people places and things and use that as a knowledge base for doing various Machine learning tasks around mostly natural language processing. So to try to extract people and places and other sort of common Constructs out of text and they this example about John Coltrane He's a famous saxophonist and composer right and they had the wikipedia box of all of his Categories that he fit in and when you go on wikipedia and you look for stuff You know they have these info boxes on the right and it says all sorts of stuff that I knew about John Coltrane You know avant-garde jazz and you could argue right say that jazz that genre doesn't make sense He was not a hard-bought guy. He was more of a I don't know It's more of a modal guy or something but you could argue about that, but that's not what's interesting Although it's up for debate. What's interesting is this he is a 20th century Christian saint All right, which is kind of weird Although you may happen to know that in San Francisco There's the church of John Coltrane where you can take your saxophone and go and and worship actually and in fact He was granted sainthood by a small American African-American church, which now it says has about 5,000 members Okay, so there's at least 5,000 people who think he's a Christian saint But when you use it as like a table in a database or a knowledge base and you start looking up, you know 20th century Saints, you know John Coltrane just gets tossed in the bucket with mother Teresa and everybody else And most people would be a little surprised. Maybe even John Coltrane. All right So weird stuff happens when we all get to say what the data is and maybe it's good and maybe it's not All right, how does this make you feel? How much data does the NSA look at daily? This might make you feel great by the way makes me feel. I don't know how it makes me feel The NSA looks at 1.6 percent of total internet traffic, which is about 29 petabytes a day So this is traffic on the wire traffic in motion in fairness, but they're they're capturing it. All right So for context Google in 2010 said it had indexed point zero zero four percent of the data on the net So by inference from the percentages the daily NSA data collection is 400 Googles Oh I don't know if that's true frankly, but it's a daunting all right or 126 Facebook's a day being gathered by the NSA Okay, so I don't know how that makes you feel it's interesting. There's certainly some really cool big data challenges to work on at the NSA. No doubt all right Here's another one. I had a graduate student who got really interested in data research having to do with health care in Africa And he spent a lot of time in Tanzania and Uganda and one of the things that motivated his work was this statistic 76% of children born in sub-Saharan Africa are unregistered not only do they not they don't worry about there being too much data They don't worry about people spying on them They don't even know who's been born and like maybe would need to get a shot one day when they're passing out the shots Like in a lot of these countries, there's no day zero zip Right, they'd love to have data because then they could help people with it Okay, interesting thing. So this student Kwang Chen who started a company after graduate school here in Berkeley called Captricity now in Oakland He went down there and this was the kind of data they had at the health care clinics He visited in Tanzania Someone one of the doctors who really could realize that sort of statistics could help them figure out a little bit about How to deliver health care this clinic would put these things on the wall hand-drawn charts Okay, and this is like a database by the standards of that village. Okay So Yeah, so data in that context and Kwang went out ahead and worked with these people and got them sort of using Cell phones to take pictures of data and upload it to the cloud where it could be processed in the big city all kinds of good stuff happened from that research and they've they've done a whole bunch of Disease and AIDS studies using that technology is pretty cool and this company started as well But data there like there's just not nearly enough of it Not nearly enough right and a lot of it has to do with sensing and data gathering So how do you get data in these contexts and a lot of times the answer is paper because people will write stuff down on paper It's a great technology. It goes everywhere All right, so that's interesting There's of course this the classic pie chart percentage that looks like Pac-Man percentage that doesn't so that's data that makes you just feel silly If you actually want to generate that chart you have to come up with some numbers Anyway, so anyway data will certainly be at the center of major issues and events in our lives I think it's worthwhile as we think about stuff in this class to think about what we're enabling to keep this stuff in mind And as you think about your next job and so on, you know What are they using the data at your company for? Or that your nonprofit organization or your university or whatever you happen to work at So that's a little bit of the why for this class. It is big stuff. It's really big stuff. It's society changing stuff Okay, so what are we gonna learn about? Well? What is a database? Pretty easy to see the database in this thing. It's an IBM data processing system It has a label on it. So back in the day databases were pretty easy to identify they were also really really boring and They weren't even considered really worthy of academic study. This was mostly work that was being done in places like IBM Back in the 50s and 60s, although they actually ended up doing a lot of really interesting computer science Most people would agree that your bank has a database that holds your money and the classic example of Transaction processing is banking. So that's not terribly surprising So this is a website on the census you probably would believe that the census is a database and the data behind this website is is a Database that's not terribly surprising either All right, so is Google a database? How many people think Google is a database? How many people think Google is not a database? Fair enough. How many people? Yeah, I know most people didn't vote. I won't ask why Uh linked in so there's likes well, I think there's a lot of databases here, but let's look at some Say happy work anniversary to your friends who have a work anniversary. That's pretty much Look up your friends whose birthday, you know whose work day equals today. So that's pretty clearly just the database query That's pretty boring Okay, what about this pulse recommends this news for you? Well, I don't know news is kind of text. I don't know if that's a database or not Maybe that's a database and then recommendations is that is that databases? I don't know. Maybe yeah, probably. I don't know People you may know this one's pretty interesting. There's a graph a social graph here and Wandering around that graph and recommending who you may know that that's an interesting special kind of database maybe and then I get ads for Hadoop because That's the kind of geeky guy I am so the ad they choose to target to me is probably coming from some database There's a lot of databases floating around on this one web page Right and actually most services you use Facebook LinkedIn and so on issue hundreds of requests to Dozens of databases before they pull you back a page Okay. Oh, yeah, and then I have I have all sorts of communications that have been stored in various ways Maybe that's a database too depending on how you define it All right. This is um, you know a Terminal from my laptop. This is vars, you know varsis log Is that a database? How many people think that's a database? How many people think that's not a database? It's a really crappy database. I can by the way tell you the history of why system logs look like that It's all based on Berkeley and it's kind of sad but As a guy whose company helps people clean up these logs though. I'm kind of happy about it, but it's really pretty bad engineering It's messy. All right, and then these guys are they databases. Hmm. I don't know your phone's got some database software on it Probably so the so does your Fitbit They're generating data. They're storing data. They're communicating data. I don't know Go Pro so GoPro You know it's not only taking pictures, but there's all sorts of information It's getting about how you use the device right so it's not just the primary function of these devices But also sort of where you go in space and when you use it and all kinds of things that they can gather Right if you always wear your GoPro on your motorcycle helmet, then they know where you go So I don't know is a database maybe So, you know, let's not split hairs. Let's just say a database is a large collection of structured data. So why don't we rule out like? pros Except that our homework. I'm gonna make you look at a whole lot of pros And so it's kind of gonna be a database project But their structure really everywhere is the bottom line anything you look at you can extract structure from and do analysis So pretty much any data could be Database data and then the question is only really whether you're organizing it or not But these days most of the time you organize it kind of when you need it So you decide you're gonna take, you know all the works of Shakespeare and do something with them Well, once you carve it up and get the statistics out of how Shakespeare use different words you have a database Okay, so pretty much any any collection of data is is a potential database these days Now a separate question is what's a database management system? What's the software? Okay, that supports your database and well, it's it's a chunk of software that stores manages and or facilitates access to Data to databases. Okay, so we're gonna be pretty catholic with the small scene meaning open-minded about What what we call a database. I'm not gonna get all hung up on it That said and actually what we're gonna learn in this class should be useful across all these different kinds of uses of what I might mean by database That said, I don't want you to go out in the world talking differently than everybody else So when you go out in the world most people when they talked about databases up till reasonably recently We're talking about relational databases with transactions a la oracle Microsoft SQL server IBM DB2 and the like all right So that is a common usage It's also a really mature technology with a lot of interesting stuff in it So we'll make reference to it a fair bit in this class your textbook is very focused on relational databases The lecturers you'll see based on the order of material and the way we kind of break it apart are less Faked focused on the relational database per se So today the market and the terms are actually in rapid transition So the tech behind the relational database and the tech behind the databases that are being built at places like Facebook And LinkedIn and Google aren't so different really so the techniques you're gonna learn in this class many of which were invented like in the 1970s 1980s 1990s a Lot of that technology is the technology that people are either using or reinventing in large-scale big data systems All right But there are various pressures to kind of remix this technology to revise it to put it together in different ways and to change some Of the key assumptions and those pressures are coming from a bunch of directions All right We're gonna hopefully at least touch on a little bit of this in class But because things are changing very rapidly right now and we aren't kind of a time of transition I will ground you in in in sort of the more textbook stuff Which will give you the tools and abilities to invent the next generation of stuff frankly Okay, but hardware is changing quite a lot So when your textbook was written all data of all large databases of note We're working on magnetic spinning discs. All right, and that is just not true today There's lots of data that's being either staged or permanently stored either in RAM or on flash Which doesn't have quite the same performance characteristics as magnetic discs. So after 50 odd years of building systems focused on magnetic discs, there's a lot of rethinking going on Okay, the basic ideas are the same, but suddenly, you know, you're like, oh, hey, I don't have to worry about disc seeks and You know, I want to make sure that I don't have memory contention because everything's faster So hardware as always is true in systems in computer systems hardware is changing some of the rules Data volume is changing some of the rules as well And it actually works against the hardware point You can get a phenomenal amount of data in your laptop and memory Especially if you compress the data you can run a pretty big compressed database a lot of the analysis You could ever want to do you could do on your laptop on the flip side You walk into a place like Facebook and they'll tell you they're running Hadoop I don't know like 5,000 machines for this and a hundred, you know machines for that and 20,000 machines for this And you sort of can't do the arithmetic It sort of almost doesn't pan out and they're storing it on magnetic drives to a large degree All right, so when you're dealing with things at very big volume Some people some some companies some engineers will sacrifice Efficiency for scalability and ease of management and ease of rating software. They'll say, you know what disc drives are fine We're gonna spread this out on lots of lots of machines. We're not gonna worry about optimizing each individual machine Okay, but because we're at this huge scale We're also gonna give up on some of the things that we wanted in relational databases because to make something work On tens of thousands of machines or to work across the globe We have a speed of light communication between machines You can't implement some of the classic algorithms in an efficient fashion because they assumed you could access memory like super fast Okay, so there's been a lot of work in the last 15 years on relaxing the constraints of traditional databases giving up on some of What they promised and then in the last about five years trying to get some of them back again Because it wasn't such a good idea to throw out the baby with the bath water and that stuff So that's an interesting space right now It's dealing with data. It's just gigantic volume and then last but not least because data is is now clearly at the center of computing It's hard to imagine what a computer's for if it doesn't have at least a decent amount of data, right? like used to be computers were for Computing right like calculations and stuff. Let's not really true anymore computers mostly revolve around The data that they store and access And so with that basically all the things you might want to do with computers are things you want to do on large amounts of data So there's a very wide variety of usage of how people use big data now Machine learning algorithms graph processing a whole bunch of workloads that really didn't exist Even 10 years ago at the level of the database and now they need to be pushed down to storage for efficiency So that widening variety of usage is opening up a whole bunch of challenges as well So these are all changes and frankly like this is the first lecture of a research class of databases would be to start talking about those Three things I want to flag them today so that when you know three weeks from now And we're talking about spinning discs and disc arms and disc seeks and maybe we're assuming we're on a single node You won't be like Hallerstein, man. You're living in the past like I'm not living in the past It's good for you to learn how this stuff works on some of the technology Which is the traditional technology But be aware that when you guys go out in the field the tools and techniques we're going to cobble together in this class We're going to have to be rethought to some degree as you apply them in other contexts And I'll try to flag that as we go. Okay, but the stuff you'll learn is useful today It's useful tomorrow, and it gives you the basis for what will be happening in five ten years All right, so it's a really good time bottom line to focus on the fundamentals I will not teach you the internals of exactly how oracle works because it kind of doesn't matter But I will teach you the building blocks that allow you to build systems like Oracle or Hadoop or a docequel database or whatever comes next All right So what is a database really our database management system? I should say is an operating system a database management system So this is a classic no, okay And for many years the OS and database communities distinguish themselves based on this sort of the software engineering communities and the research communities Clearly you can put data in RAM. So that maybe is a database Every programming language does this really so maybe Python is a database system Probably not It's fast RAM is great. It's super fast. It's random access so you can look at any part of RAM you want and that sounds awesome, right? So What's wrong with that? But it gets better every operating system comes with a file system, right? It manages these things called files on a disk usually a persistent disk like a flash drive or a magnetic disk It allows you to do things like open files and read files and jump around in them with seeks and then close the files All right, and it lets you set protection on the files. This file is unreadable. This file is only readable by me, etc So that's kind of like a database What what are some drawbacks of a file system relative to RAM? Why don't we just put all of your Python state in the file system? What's the point of RAM? Yeah All right, just guys are still pretty slow even flash drives are pretty slow actually They're like at least in order of magnitude probably multiples of order in magnitude slower than RAM access depending on various configuration issues And I neglected to put into this slide deck. There's Analogies of like this is how long it takes to get to memory. This is how long it gets take to get to disk It's sort of like Things that are in this room versus things that are somewhere on campus, you know That kind of thing it's like an order of magnitude or more time to get things that are on disk So it's slow anything else about file systems that maybe isn't so good Relative to RAM Yeah They're not really random access you can seek to an offset in a file, but If it's not a magnetic disc you have to kind of spin the disc and move the disc garment It makes that horrible chunkety-chunkety noise right to move around And it doesn't have memory cells the way that that RAM does and you don't have a language layer on top of it That lets you follow pointers around quite as nicely. So it's not really a random access device RAM is random access Yeah, ah Interesting so what I heard you say was it might be getting used by other programs So the API to the file system is shared across multiple processes in an operating system And if you remember from your operating system class one of the first things they do is virtual memory Which prevents multiple processes from accessing? RAM right at the same time So there's something in the file system that might worry you with multiple processes going on now multi-threaded programs have the Same problem in RAM. Okay, so concurrency, which is another word for this is a problem that actually comes up in Multi-threaded RAM based programming as well But it came up earlier historically in places like file systems and databases. So it's an excellent point Okay, so here's a thought experiment related to that actually you and your project partner are working on homework one Which will be passed out today All right, you both you let's say you were doing it on the instant machines, which you're not Which is good, but supposing you were and you were both, you know running vi or your your favorite editor on the same file And you both save at the same time. Okay, so we're gonna do a little poll You both saved your changes to the file at the same time whose changes survive A yours be your partners see both Dean either or e question mark Could I get a vote? How many people raise your hand if you believe a is always the answer? How about be is be always the answer is see always the answer D But he Yeah, pretty clearly right non deterministic what the heck's gonna happen including things like the file is destroyed You know all kinds of crazy things can happen. It's very very bad file systems Do not like you to do this and they don't help you with it that much All right, here's another thought experiment. You're working on your file and the power goes out All right, which changes survive? well nowadays you have a battery in your laptop, so it's not so bad, but All your changes survive none all since the last time you pressed the little picture of a floppy disk at the top of Microsoft Word You believe they still have that icon Which changes survive? I don't know. I mean the floppy disk icon is awesome, right? because you click on this thing that looks like a storage device from 1988 and then um Like some pixels make it look like you pushed it in as if it were a real button And then the power goes out So was it saved I don't know what about when it the button comes back Is that is that better? Like nobody knows right and then I don't know if you notice this But Microsoft is a very sophisticated operating system that they've been working on for like two three decades With a very fancy file system underneath it called NTFS. It's got all kinds of logging and recovery built into it But what happens when you start up your machine after you crash in the middle of word? You start word again. What do you see? Recovered file number one right so the guys in the in the Microsoft office division have implemented their own recovery algorithms Why I guess because the windows file system guys weren't getting it right, right? That should make you nervous Okay, it's weird when the same vendor is doing recovery at the application level and at the system level Suggests that they're not using shared components that are reliable a fair though I think somewhat flawed point was made in the frontier Which is that well, you know word does run on the Mac And so maybe they're just protecting themselves against apples bad operating system But the truth if you know anyone who's worked in the Mac division of Microsoft like that is a forlorn sad little corner of Microsoft that actually They don't make decisions based on what's going on in the off in the Mac division Okay, so here's the thing you're a developer like you're writing Microsoft Word and you go to the file system guys You're like what's going to happen in this scenario they go hmm So how do you write code for that and the answer is you don't you just like have to worry about all the possibilities and code against that All right, and when you take non determinism of your compute infrastructure suppose I told you that some of the bits in your RAM might flip sometimes Sorry sunspots or whatever, but it happens like you know once a month Maybe once a week how about if you're running like a thousand of these machines. It happens every two or three seconds How do you write software right so you need to have abstractions that are reliable as you go up the stack Okay, and a big thing that we're going to talk about in this class is that a database management system is a piece of software That makes programmers lives easier because it's going to give us some abstractions that will allow us to stop worrying about some stuff Okay That's the answer to that question All right, so what more could we want than a file system we could want these API contracts regarding data I want to be guaranteed that I don't have to think when I write a program about other people running instances of the program at the same time I don't want to think about concurrency control at application level. Please handle that for me. I Don't want to think about replication of my data in case my media fails Please handle that for me. Wouldn't be nice if like your family photos have that property I guess if you put them up on flicker or Picasso they do but until they get there. They don't write I photo has lost some of my cherished memories So replication recovery things like that I want guarantees from the API so that I as a developer don't re-implement that stuff up at application level And to that end I want a lot of data right so I want to have a high-level language sort of a domain-specific language for data I don't want to be writing my data access in Python because frankly Python's not a very nice language for dealing with large amounts of data. I want a simple efficient well-defined language That's appropriate to the domain of working with data something like a query language And we could talk lots more about that some people believe that what I just said is complete horseshit Some people believe that what I said is horseshit in the other direction Which is that all programming should be done this way and Python should be thrown out the window for even like you know Writing mail clients so I don't know it's an interesting question. What's really a domain-specific language But certainly there's a long tradition that says that there are good domain-specific languages for querying data Because there's lots of data I want a system that will do the standard stuff that you need to do with data implement it once efficiently and scalably Be fine if it was a library doesn't have to be oracle Okay, but I don't want to have to rewrite sorting over and over every time I have a terabyte that I need to sort a little anecdote along these lines so this is like bread and butter everybody knows this nobody would implement Sort every time they need it for lots and lots of data But as the machine learning stack was developing at companies like Google What would happen is they didn't have abstractions yet They still kind of don't in the machine learning space for what are the core algorithms and and system components library components that you need To build a good machine learning stack And so what would happen apparently is like everybody would come to Google back in like the early 2000s and write naive Bayes class fires like because they just got out of college and they do how to do that So it's cool. And apparently if you did a search in the Google code base like naive Bayes class fires You get thousands of implementations right because and it's such an easy algorithm You can write it like in a few lines in the right language So here in the database domain, we're gonna be able to actually bust this down into a very small handful of design patterns and Algorithms, which is really quite nice, and they're very widely usable Okay, and then data modeling is actually an interesting topic Surprisingly interesting topic these days. I think common wisdom conventional wisdom, which I agree with is that most data You're gonna get by volume is gonna get spooled off of these machines. It's gonna come from software logs It's gonna come from devices. It's gonna have whatever crazy log format it had when it was born And that's fine and you'll dump it somewhere and then one day you'll want to analyze it and you'll take that horribly structured data You'll say well, what do I need to input into my analysis package to answer the question I want to answer or predict the thing I want to predict and to do that what you're gonna do is you're gonna take the data from one format structure or model You're gonna map it into another format structure or model And you're gonna do that custom because you have a new analysis you want to do But what's gonna happen in your company is you're gonna start to productionalize that analysis It's gonna eventually turn into like the recommender system for your website Okay, and as part of that it's gonna kind of mature and you're gonna get some software engineering around that process You went through to go from raw data to cook data Okay, and you're gonna want to integrate that cook data with other data like linked in for example They don't just look at what you type in they also get resume data from third parties And they get demographic data from the government and so on and you need to start to be able to put this data together and understand how one thing relates to another and What you do is you end up evolving a data model Okay, now back in the textbook days if you read the textbook the way they say that you build a database is you turn On the database you define your data model and you type in all your table names and column names and data types and all this stuff And then you populate it with data That's not actually how things work that often anymore But the basic theme that at some point for software engineering purposes and for data management purposes you want to model your data That remains very very true. All right, and that's something will that a database system should provide and we'll learn about in this class Now there's a persistent through the ages Belief that this is just a simple matter of programming all these things, you know, why have a system for this? Why have a library for this? Just build this stuff Not that hard and I would say that is true for all of computer science Okay, there's nothing false about that. It's just you know, these things are tricky And there's really no reason to do them over and over and when you do them over and over you get weird artifacts Like you get recovery happening in multiple layers of your stack and bad things happen And this is a persistent lesson that that community is relearned So you heard about the Microsoft example I gave you but the same stuff happened for instance at Amazon Which was basically where the no sequel database was invented and popularized What ended up happening is they this very scalable No sequel database called Dynamo that they got a lot of attention for had a lot of copycats and open source And it gave very few guarantees on like whether two replicas of the same diet item would really be the same So what ended up happening as they build applications is all the applications had to deal with the fact that you might have divergence of replicas And that code got pretty hard to manage and pretty expensive to keep live And so eventually they built a better infrastructure than Dynamo Which actually is more guarantees than the no sequel stuff that a lot of the copycats now ship Okay So organizations have a tendency to start building things from scratch and then to abstract out reusable components and guarantees with API And so you'll learn in this class some nice cut points some nice libraries and guarantees Where traditionally it's been good to layer things and hopefully that'll save you some pain along the way Or maybe when you see this pain somewhere you'll say you know I think we could alleviate some of that pain with some shared services. Okay All right, so the current market so for the record Berkeley has for since the early mid 70s had a tradition of Impact from our research systems at Berkeley on the industry And I also have done research at Berkeley that we've transitioned to startups and stuff like that Which is where I've been for the last three years And I think it's good to know what's going on in the software market both the internet software market So think understanding what's going on at Google and Facebook and LinkedIn and the like But also the enterprise software market people who sell software to other people What does that software look like? One thing to keep in mind on that and this is really important I don't know if you get enough of this message here at Berkeley right in your Silicon Valley There's like five companies that have you know Twitter and upsized problems, right? There's there's Google there's Facebook and everything else is smaller But there's only like five to seven companies that have these really really big big data problems And they make certain trade-offs that optimize for their scale Those trade-offs often involve making the software really really simple so it scales out really far Those designs are often very poor designs for almost everybody else on the planet Okay, so Hadoop for example was not really well engineered for anybody, but Yahoo And it's taken eight nine years for it to be something that you can kind of deploy at a bank now but Just because something's been built to scale up to Google doesn't mean it's a good design for almost everybody else on the planet And often it doesn't mean it's an interesting design either because a lot of times They just simplify things to the bare minimum so they can scale it Okay, so one thing we'll keep in mind as we talked through this semester You should be raising your hand or at least thinking in the back of your head Will this scale up to Google sized things and if the answer is no Will it scale up to like things that are a tenth the size of Google because that's almost everybody Right, so keep these things in mind as you talk with companies in general lots of technology is almost more interesting Sometimes and certainly more widely beneficial when it's done at a scale that most people can use Okay, all right when we talk about the database market though The relational database vendors still dominate certainly sales of databases not bites because bites are in the internet companies But sales for sure And they've been around for a long time very mature technology All right And actually when you peel back the cover of something like Oracle or IBM They've implemented pretty much every trick you can think of like there's just 30 40 years of stuff in there now They don't innovate a lot as a result because the software sex kind of so thick thick and cruffy kid Not very malleable anymore, but there's a lot of goodness in there, okay? And in open source, you know my sequel and postgresql and sequel light and all these things are used very very widely And postgres, which is something we'll play with probably when we do our sequel homers. It's actually a very full-featured relational database built here Berkeley originally and You can get pretty good databases out in open source There are variants of the relational database that you'll hear thrown around when you got in the field things like main memory databases or in memory databases Which has come up because memory is getting so big, but you still want that database abstraction layer and Then column oriented databases where you store things by columns instead of by rows Which doesn't sound like a very big deal and some level kind of isn't but you hear about it a lot in the market These are variants of relational databases At the same time the sort of open source of no sequel is growing very quickly so on the analytics side This is things like Hadoop MapReduce from Yahoo Essentially and the open source community and then spark which comes from Berkeley has been growing very quickly recently And then key value stores like Cassandra from Facebook and MongoDB and couch which are independent Companies those are getting widely used as well. We'll talk about those systems in this class Obviously search is an important special case of a database if you're dealing with large text corpora, which is the plural of corpus Which means body? then You obviously need to deal with text search So you know about Google and Bing on the open source side solar and leucine share the same history But those are open source packages that are designed for text database search Interestingly databases in the cloud are expanding very quickly. So Amazon EC2 has a bunch of services for data management Elastic MapReduce, which is a Hadoop deployment elastic search, which is a solar deployment and then redshift, which is basically Postgres Which is a relational database Thing that you could use in the cloud is their fastest growing product ever actually at EC2 So lots of people using databases in the cloud You wouldn't think people would put their corporate data up at Amazon But they a lot of people are and Microsoft has has its own stuff And they're smaller players as well like Heroku where you can go get a database and put things in the cloud And a lot of those are built on relational database or other technologies like that Okay so we're about About 50 minutes into this thing So what are we gonna learn about this class? All right. Well, first of all these design patterns for computing with data My feeling always has been like a database is not it doesn't have a purpose in and of to itself Database systems don't have purposes in and of themselves I should be teaching you things that you can use to build interesting computing systems Right, and that's what this is going to be about You're gonna be learning design patterns for computing with large amounts of data All right And if in your career you need to build a system that deals with large amounts of data You will probably go back to some material from this class. Okay, so that's primarily what I want to you to walk away with Also things about like structure and data when why and how should you structure data in that process of going from raw data to a particular analytics output to a Data product like a recommender system or a corporate database like when and how do you deploy sort of data modeling ideas along the way I Do want you to get the you know the basics of like how does a relational database like Oracle work? How does a search engine like Google work at least at a kind of nuts and bolts level if not with all the refinements because God knows There's a million refinements on both of these You'll learn SQL Very useful Everybody uses SQL still you will learn about no sequel systems and what they do which is actually really easy But we'll cover that along the way You'll learn how to manage concurrency and you know the techniques that were developed in the databases community for managing concurrency and Transactions have been applied all over computing including to hardware. Okay, so this is a very basic topic It's not really a database topic It's a topic about how do you think about concurrent processing which is a thing that computers just do especially when you have more than one of them Okay, we'll learn about fault tolerance and recovery. We'll learn some very particular specific techniques for fault tolerance recovery I will talk a little bit about alternatives as well And then we'll talk I think I'm gonna try to weave this in from day one. We'll talk about scale out How do you parallelize things by getting them on multiple machines instead of just one and kind of farming out compute into clusters? And we'll also talk about replication to some degree when we talk about no sequel And then there's a if you go to my web my personal web page didn't dig around there's a poem by Herman Melville You know the guy wrote Moby Dick on art poems called art And there's this this line in it that I really like he's like you know He's talking about what is art and how do you make art but he talks about there be audacity and reverence that combination of like Being willing to just be out there and say that things are wrong and just reinvent stuff And at the same time the reverence the idea that there are things that you should respect and admire and and learn from so We're gonna try to do both of that. I'm gonna try to encourage both of that in this class. There's a lot of Classical material in the database field that's worth knowing and it's also probably good to think about what would happen If you threw it out the window, okay, so we're gonna try to ride that balance as we go through the semester to some degree Okay, so summing up this part of the lecture data is as we said, it's kind of the center of everything In particular though before I go on I really feel like data is at the center of computer science And since that is the major a lot of you guys are in I just want to take a minute and really think about how computer science has Changed even in the last five years, but certainly in the last 20 from a field where you know data was sort of a add-on Afterthought to a field where data is kind of at the core Okay, so we're gonna talk about this in a couple different ways But fundamentally you might think in this class. We're gonna learn to apply computer science to big data It's actually the other way around we're gonna learn about stuff in this class That's gonna allow you to do computer science all right because computer science going forward is gonna be about large volumes of data It already is today frankly Okay, so this class should really apply very broadly all right and before I go on just a little more ra ra And then we'll get into some of the details This is a slight adaptation of what I said three years ago This was the stuff where I said oh in five years. I left it on the slide actually should have changed it So three years ago. I said in five years those professions would become very big professions So we're three out of five years into it people who program cloud systems. Yeah, that's a big thing Right a lot of the jobs you might be looking at coming out of this school We'll be building systems that scale up to Amazon Google, you know Microsoft Azure or Facebook type of sizes Data scientists was like a people weren't even willing to say that it was a term three years ago And now it's very clearly a growing field We're we have a data science class at Berkeley. There's a data science major at Berkeley in the iSchool And there's lots of people hiring data scientists Okay, data engineer is kind of one that still hasn't caught on as a word But lots of people who work at IT around data systems are essentially Engineering pipelines to build data a lot of those cloud programming tasks are data engineering tasks Machine learning architect, what does it look like to build a stack from machine learning? These are things that are emerging today, you know Not just what's a clever algorithm for better clustering for example like you might learn in a machine learning course But how do you build a pipeline so that you can build say a recommender system or a fraud detection system? And what are the pieces of that pipeline look like the engineering side of machine learning? It's really just emerging But in I said five three years ago in five years to be a large fraction of the computing workforce And I still think that you guys have a chance to be leaders in this space because it's not too late This is still emerging so now is definitely the time to jump into these data-centric pieces of the field Okay, a Little administration, so who are we so briefly I want to introduce Four out of the five TAs and I'm Joe Hellerstein. So just background. I joined Berkeley in 1995 out of graduate school I've been here since I've had various Meanderings through industry. I ran a research lab for Intel. I've been involved in a couple startups including one that's still ongoing Mostly in the database and internet kind of area And what can what else can I tell you? I Went to Berkeley for years. I'm a Berkeley alum, so that's good That's probably it and then I'll let these guys introduce themselves Want to go first and actually I'll give you the microphone real quickly Hi Hi, I'm Derek Leung. I'm a third-year computer science and math major here at Berkeley. Hi guys My name is Vikram. This is my last year at Berkeley. I taught 186 last spring and Yeah I'm Michelle and this is my last year at Berkeley as well I'm Jay and I'm a third year And I'm Anthony and this is my last year at Berkeley as well All these guys have taken CS186 and Vikram has taught it and they're gonna be very close to home with you guys because they're Very recent experience with the class. So I I have always actually enjoyed working with undergrad TAs more than grad TAs and Real happy to have these guys aboard. So it's gonna be good The other thing I'll say is when the class gets bigger we're able to have more TAs more TAs means a better structured course quite Frankly, so in some ways it's good to have this many people All right, so let's get into the nuts and bolts of the class work load There's gonna be homework. We're gonna try to keep a real-world focus to the homework So we're gonna do things like wrangle up some messy data to extract structure from it All right. In fact, that's the homework's going out today. I better hurry to teach you how to do it We're gonna code up some scalable algorithms So I'm gonna have you implement an interesting analytics algorithm in a language that can scale up and be parallel probably SQL We're gonna modify the internals of a big data engine We're namely spark which is the one that's being built here at Berkeley So you get your hands inside sort of a hot new system and modify it to add function to it And then towards the end of the semester will work with data visualization technologies to build applications on top of the database that Provide data visualization. So those are all things that are very useful things to know out in the real world And we'll exercise ideas from class In order to just kind of keep things ticking along We're gonna have what we call vitamins, which is the weekly quiz. It's good for you. You should take it I Used to not do this before I had lots of TAs so the first one I taught here three years ago We started doing this Before that we didn't have quizzes because they're only a hundred students There were two TAs was like who's got time to write all these quizzes and grade them The thing was I used to get teaching students a good class and all that but like you know the midterm would come and like there'd be all this stuff that wasn't in the homeworks and and We just like hit us like a truck. We had no idea. We were supposed to actually learn that stuff from lecture So that's bad So this is gonna just kind of keep it ticking along make sure you're paying attention and remembering what happened in class They're not gonna be hard. They're gonna require you to keep up, okay? And they're really there for your health like a vitamin All right, and they will count towards your grade a little bit because if you're like me or like I was in college You know, nobody really wants to do extra work. You got too much going on So we have to give a little incentive for you to do them So it will affect your grade I would expect that anybody who's diligent will get perfect scores or near perfect scores on these quizzes through the semester They're not gonna be trick quizzes There'll be two midterms the dates have not been fixed yet So we have to get an overflow room to to be able to administer the midterm to all the other people So I have to get that set up, but the midterms. There'll be two midterms and a final of course the final date and time is posted on the web We have a website you can get to it at CS 196 Berkeley net The office hours sections all that kind of stuff are up on the website. Everything should be linked from the website There is a textbook it's database management systems third edition Which is getting a little crusty by now But there is going to be no fourth edition because both those guys work at Microsoft You know they used to work at Wisconsin and Cornell and now they work at Microsoft It's a sad thing. So you guys should go get PhDs and take their jobs, right a really good textbook It's a bit. It's an okay textbook. It's pretty good But like I say, you know, we're gonna have to augment it with some updates for modern times And we're gonna jump around it in a fair bit because I don't like the very traditional sort of organization of it It's not the right way to think about this stuff It's very relational database-centric and it really starts thinking about data modeling way too early. I Would not buy any of the alternative textbooks But you might want to look at them at the library sometime if you don't like the way that Ramakrishnan and Gherki explain Something and you're confused a lot of times just reading the way someone else explains it helps a lot So I've recommended one secondary textbook, which is the silver shats and other people core than silver shats textbook It's pretty good, too. It's the one I used in college a long time ago And it's fine, but I wouldn't buy it these things cost way too much Which is why no one wants to rate one anymore because you don't make any money and no one likes you. It's a bad deal The website has links to programming resources things like you know learning Python and SQL and stuff like that just you know you could Google them yourself, but we've provided some handy links Grading and hand-in policies all that are on the web page I want to take a minute to talk about cheating Don't cheat my god. You're at Berkeley This is one of the like perhaps the greatest Institution of learning in the world. You're here to learn You'll all be fine. You'll be fine. You don't need to cheat. It's crazy Okay, and beyond that you're cheating yourself, which is really true And we'll catch you which is also true a lot of times So You know we have software that will crawl over your software And you know we have eyes in the sky and big data and all that so don't cheat Okay And if you know someone's cheating and you really want to tell me you can but we don't have an honor code like that at Berkeley I don't think All right Piazza really important all-class communication via Piazza. There's like 350 400 of you right now I don't want to get email from you guys. That's too many people Okay, I also don't want like 25 you to come up to me before class I can't talk to 25 people when I'm setting up the microphone. I'm sorry If it was a class with 20 we would totally do it be very personal, but we got 350 400 people so We should use piazza Post questions the good thing about that is when you ask a question the answer is probably relevant to like 30 other people anyway Right, so ask the question out loud on piazza will answer it in a timely fashion. That is definitely the way to get hold of us Same applies to your ta's although since they're dealing with you in batches of like 35 usually 70 actually each They might be able to take a little email, but still it's like not fair 70 people emailing them at once is a disaster area, right? So read piazza regularly you are responsible to know what's going on there on a more or less daily basis That's it. There will be homeworks The homeworks will be either solo or in teams of two if you will have to stick with your team of two through the Semesters, so please you know do your speed dating this week because next Tuesday We will pass out our first two-person homework assignment question So right now discussion sections. We're gonna try this week You can go to any discussion section you like all right a lot of them are doubled They're two at the same time sometimes there's one in like Wheeler and one in atchivari So you might if you like want more attention, I would walk over to Wheeler It's because like a lot of people would be too lazy to do so We're gonna try that this week if we get too much skew, you know some sections are too big and others are too small Just by happenstance Then maybe we'll randomly allocate you guys but for now we're gonna just say go to whatever section you want and we'll see if that works Okay, we're gonna have to take a little time to learn some computer science now Yeah, so here's the thing two homeworks are being passed out now right now First one is homework zero. It's worth zero points, which is cool But if you don't do it you get kicked out of the class, which is bad Okay, so this is just to make sure you register with github and you get a course account and all that stuff So that we can deal with homework one if you don't have sort of like the wherewithal in the moxie Does sign up for github and and fill out your course account and log into your instructional account in the next two days Well, then you probably don't want to take the class right and there's a hundred other guys out in the hallway who do so Please do this by Thursday or you know the consequences. We will probably boot you from the class Okay, it's it's gonna take almost none of your time do it. You have two days. It's 48 whole hours I know you don't sleep anyway. Go for it Homework one is doing a week homework when it's a real homework. It's not hard, but it is kind of pick some time I did it over the weekend. It took me some hours. So, you know, it'll take you some more hours But it's not hard in fact It's so not hard that like there's almost no instructions other than how to set it up because you're basically just gonna figure out How to do it yourself, but in the next 15 minutes, I will teach you what you need to know to do homework one And details on the github repos and all that stuff are accessible from the course website The interesting information like the how-tos for homework zero and homework one are both on github You can read them on the web first if you don't know how to use git and they'll they'll step you through all the details Okay, so a little for instance, let's learn a little computer science Just a little bit at a high level enough to kind of get you dirty with your homework This is what's called a von Neumann machine. How many of you guys saw the new Alan Turing movie? It was okay. It was a little boring. It was okay. I would go see it. It's like history and stuff Anyway, the von Neumann was another math guy He was Hungarian though. He wasn't British and he came to the states and bad things didn't happen to him Which is good and he defined the kind of architecture that we've built for our computers. It's called the von Neumann architecture And all our computers and programming languages basically use this abstraction for how a computer works You have a CPU that CPU has stored instructions in it Those instructions are executed by a program counter in order like one read something do two do something three write something four Go to one like that's a program Okay, and then there's a memory bank where you can do the little reads and writes And you can put little numbers in that memory bank. Okay, that's a von Neumann computer All right, it's different from a Turing machine abstractly. Okay, but this is the model of computing we use And when you think about Python or Java or Scala or almost any language you would use today It kind of like that There's kind of this linear order of instructions and then these things you can put values in and take values out of and you Can kind of go to in one way or another with function calls and recursion, right? There's still von Neumann computers single threaded if you like until you decide you want two of them Okay, and then you have two threads, but they're two von Neumann machines, right? It's just you're still thinking von Neumann. Okay, I And then by the way, there's data Boy, where's that go? I don't know. They didn't have data back when von Neumann was doing computers All right, they really barely had storage and it's not part of the computational model It's kind of it's over on the side. There's some special command that can like push things out of the memory bank into the database All right, that's what UNIX is too. That's what every programming language you're likely to work with except maybe SQL is This well, maybe not MapReduce either But most programming languages are like this is terrible All right, because this is not the way the world is the world is the database and like a whole lot of computers It's not one little single computer with one ordered list of instructions, right? But this is crazy. It's deep in your brain when somebody asks you what's an algorithm and they're not a computer scientist They're like an English major like an algorithm. It's kind of like a recipe You know like when you're gonna bake a cake It's like you know you crack an egg and then you you mix it and then you add flour And then you mix it some more and then you add sugar like that's nice That's how you bake a cake, but that is not how you like create the hostess company which creates delicious cakes for mankind Right, we're doing things at scale here, right? And so it's crazy that we still think this way and and in your lifetime the way we talk about algorithms I believe will also change substantially. We will not be talking in this sequential manner We can't be talking in the sequential manner 20 years from now, but when we're dealing with big data We can't talk this way today Okay All right, we scaled up. We got lots of those things now We still got this big database on the side. That's how Hadoop works. Okay, just for the record There's like all these processing nodes and then there's the Hadoop file system, which is like some other software abstraction It's what Google built. All right. Why cuz they don't have time to think about anything else. They were building Google It was like 2004 but shame on the Hadoop community for doing exactly what they did and still doing it today Right, we're all crazy that Google doesn't do that anymore. They do all sorts of other things, but That's kind of the idea is like you take a big storage abstraction and you stick it in front of your big compute abstraction Which is not really the way it looks is that and that's the computer. You're supposed to program Right, it's a bunch of components which have storage and they have memory and they have the ability to do processing And you want to program them on mass the same way that like a commercial bakery is going to make sure to produce lots of Cakes all on mass There might be some recipes deep down inside it Right that get executed sequentially somewhere, but you need to orchestrate all that that's the hard part So this is really the key to distributed computing and parallelism is dealing with lots of data that's been spread around on lots of computers All right, so today in the last 15 minutes We're going to learn some basic patterns for dealing with big data that will scale in this fashion And that will work also for data on a single node that doesn't fit in memory Right because everything you do that doesn't fit in memory. You might as well do on two computers Basically everything I'm going to teach you now you can just do it on multiple computers at the same time Instead of kind of doing it a little bit at a time in one computer and I'll show you what that looks like So the two basic patterns you're going to use in your homework and in a lot of things in this class are streaming computation model And divide and conquer basically okay one form or another so here's what I mean First simplifying assumption. I am taking away your arrays and your lists and all those data structures that you always like that have order in them All right, and I'm giving you back collection types that do not have order like sets and relations and you know collections of records Okay, so there's no order anymore intrinsic in your data structure, which means that there's no order intrinsic in your program Okay, so you can do things in any order you want which is totally the opposite of on knowing what with its first do this then Do this then do that Okay, so things can happen in any order you like unordered handling of unordered data This will set you free This is really good and all scalable systems embrace this essentially So disorder is a friend of scaling When you can order things to your liking you can do things like reorder stuff for cash locality You can reorder things to make sure that two items that need to show up at the same time do even though Maybe they're going to arrive at different times from different places So you can postpone some stuff you can work on things in arbitrary batch sizes So if your memory on this machine is smallish, but your memory on this machine is bigish You can do smallish amounts of work here and bigish amounts of work there So you can pick your pick your batch sizes to fit that also remember memory is a hierarchy You got your l1 cash your l2 cash your ram your flash your disk and you may want to move data in and among those Those levels of the memory hierarchy and you'd like to do it in a way that's efficient Which means that you want control over the batch size And you can do that if we don't care what order you do things in it's okay For instance if some data comes at you you're like Later for you i'm going to put you over there for now. I'll get back to you later That's fine because it doesn't matter that you handle things in order Okay, and most importantly if the ordering doesn't matter to the semantics of your program Then you can tolerate nondeterminism in ordering i eat parallelism So if this machine races faster than this machine, that's fine They're working on different items this items get handled first no big deal because we didn't care what order they got Handled in anyway, so the key to parallelism without coordination is the ability to have disorder in your program and tolerate nondeterminism of ordering Okay, so this is going to be a great thing. I'm taking away your lists and your arrays and giving you sets All right Here's the thing though data tends to arrive often in streams It may not come in any particular order, but it often comes at least at one node in an order And what we want to do is we want to take all that data and it might be big It might be a petabyte of data or a terabyte of data and maybe all you have is your laptop All right, let's call let's say you have 100 gigabytes of data and you want to stream it through your mac book Okay, and work on it. Well, here's a simple case All you want to do is for every item that that's in that collection of data you want to apply a function to it F right this is like the map operation from MapReduce, okay So the goal is to compute f of x for every record and write the results out to another big disk drive All right, and the challenge is to do this in a small amount of ram And to not call that read write interface too often because every time you invoke read or write the operating system Does all kinds of crazy things and devices get involved and it takes time All right, so you want to amortize Means get the most out of or take a lot of activities and pay for them only once Amortize the cost of those reads and those rights. So here's a basic pattern for streaming It's the simplest thing ever but you'd be surprised how often people don't do this So the naive thing you'd do is you'd say well, oh my gosh, I've got this file I'm going to have to work on it. So the first thing I'll do is I'll bring it into memory And then I'll iterate over the items in the file It's like well, that's not going to work. All right, because the minute you try to bring that file into memory you're dead But that's what you do when you open a file in your editor Right, you're like, oh cool. I got a new data file. I'm going to look at it in sublime text You like open it up and it goes Right because it's loading the whole damn thing into memory, which is insanity So we're not going to do that. We're going to stream We're never going to have the whole thing resident in memory at the same time But we'd like to use memory efficiently to amortize reads and rights. So watch this The approach is we're going to read a sizable chunk Whatever seems to be appropriate to amortize those reads and writes into an input buffer in memory So we'll read some stuff We'll copy it into this input buffer and then we'll start picking things in ram So this box is ram We'll start picking things one at a time out of the input buffer applying f of x to them and putting them back in ram in an output buffer Okay, and then there's two rules to keep in mind When the output buffer is all filled up It's going to be a fixed size when we fill it up with stuff When the minute you put something in it and says oh, I'm full now then write it to the end of a file And then erase with the buffer All right, the other rule is when the input buffers got nothing in it Get some more stuff All right a buffer load and that's all and the one thing to keep in mind is that based on f of x Those things may not be synchronous. So for example imagine that f is compress So you get big objects in Little objects out so you might read a hundred big objects each one of them gets compressed by 10 You fit a thousand compressed objects here before you have to write it out So you will do 10 reads before you do a write Right and that's okay. That's fine Similarly, if it's decompressed you'll do 10 writes before you do a read and that's fine too So these aren't in lockstep the streaming is not in lockstep these buffers allow you to have different rates for reading and writing Because your function may make things bigger or small Very simple, okay But this is like the most basic pattern for dealing with a big file in a small amount of memory And it's at the core of a ton of Tricks and there's lots of interesting algorithmic issues about what can you compute this way exactly? What can you compute this way only approximately? There's all kinds of work on streaming models of computation But the basic system architecture of it is just that this little design pattern and you're going to need this for your homework Unix pipes do this Uh, oh, yeah, you want to paralyze this? Well, that's no problem. Remember This is one machine in our rack of machines. It's got an input disk in an output disk in a memory Oh, our rack has more machines in it with more data. That's cool. They just do the same thing So this paralyzes forever, right? It's just like every one of these machines does its thing all by itself And then you're done Okay, so this paralyzes trivially Now if you had an ordered data structure, you said well all the items across all the disks have to be processed in order Because I wrote a sequential program to do it. This wouldn't work, right? This is all because we don't care about the order in which these things are done Okay, unix pipes do the same thing. So these streams are basic to unix Um, and the utilities in unix there's a lot of utilities for working with data And they deal with the buffering and the operating system for you and you just connect things together with pipes So you can kind of build queries over files in in using unix utilities and pipes Here's a query find students who got 100 on one assignment and got zero on no assignments And you can use a combination of you know, you said to get the rows out of the grades file But not the header and you put it into grep and then you put it into grep minus v Which throws away things and then you do a cut to get the right fields out and you're done Okay, and this all happens in a pipeline fashion So that file grades that csv could be really really big But unix is going to make sure that it doesn't run each one of those commands on the file in memory It takes it row by row and unix. It takes it Text line by text line. That's all you're given unfortunately is text line delimiters, but that's what it does in the streams It's your memory Okay, you're going to want to use this for your homework, right? But this design pattern of pipes and single i o at the front and the back is exactly what we just saw a picture A bunch of the unix utilities you may or may not have heard of Are in this format you probably will want to read the manual pages for some of these for your homework They're very very useful very handy things to know Not exciting to learn about Very useful Here's another thing you need to do all the time. Here's another design pattern for data and frankly This is also like a core computational artifact It's in almost every computing task you need to do but it's rarely talked about in this fashion Rendezvous I need to make sure that two items are in memory at the same time This is pretty much what computation is actually and I'll save you the philosophical discussion of that That's for another class But um rendezvous key to database system certainly because the join algorithm for example Which is core to databases is about making tuples that are in one table and tuples that are in another table connect up And we'll spend time talking about joins a lot But rendezvous also happens in same messaging I want to make sure that a sender and a receiver find each other in space and time And i'm going to have to make sure that somehow the sender and the receiver's data Is in the same place at the same time in order for that handoff to happen All right, so streaming was easy. We just did one chunk at a time rendezvous algorithms are a little trickier They need to make sure that two items are co-residents in memory at the same time So that we can compute on both of them and they may be coming from different streams So how will we make sure that they don't miss each other? And if you've ever tried to teach a small child how to catch a baseball or hit a baseball It's like oh my god How do we make sure that the bat and the ball are in the same point in space at the same time? So they get hit like try to teach that it's it's like incredible that people can do that We have to do that with data All right, and we do it all the time in computers, but It's a time space rendezvous. All right, and there may be many of these that you have to do to do a computation So rendezvous trickier than streaming and usually we do it with divide and conquer What we'll do is we'll divide up the data set into things that Couldn't possibly rendezvous with the same stuff All right So I put all the apples over here and all the oranges over there because we're only going to compare apples to apples and oranges to oranges All right, so you do tricks like that to divide up the problem So these are often called out-of-core algorithms because core is a very old word for memory for ram Right so algorithms that deal with data that's bigger than ram are often called out-of-core algorithms And they typically involve more interesting things than streaming They typically involve some kind of rendezvous orchestration So the typical way you do this and this is just the design pattern We'll see examples of this on thursday and you'll implement them next week But the typical way you do this is you'll allocate a chunk of ram. Let's say b Buffers of ram b chunks of ram capital b. We'll use one of them for reading as a read buffer just like we did with streaming Okay, so we'll be able to read things in decent sized chunks We'll use one of them for a write buffer just like we did with streaming But we're going to have b minus two buffers left over to hold on to stuff so that we can do rendezvous You know so that something can stay there in memory long enough in time That's something else that it's supposed to match with it will show up Right so that the bat and the ball will both be in ram at the same time Essentially if you like or the sender and the receiver will know that there's a place where that message will be That they will both be able to get access to Over time so that's what the b minus two chunks are for and the basic idea here a typical pattern Not the only pattern because sometimes it's reversed But a typical pattern is you stream wise divide the data into b minus two sized mega chunks Okay, so we're going to take the data We're going to stream it in and then we're going to take b minus two sized mega chunks of it And we're going to write them to disc so you'll like bring in b minus two chunks through the input buffer You'll massage them and do whatever you need to do to them to get them all ready for the future Maybe you'll sort them. Maybe you'll build an index over them. Maybe you'll I don't know compute a statistical signature of them I don't know what you're going to do. So you give them a good massage Then you're going to write them off to the disc and then you're going to take b minus two more And you're going to massage them and write them off to the disc and now on the disc you have partitions of your data And then in the conquer phase though, those have all been conquered Individually and then in phase two We need a streaming algorithm over those conquered meta chunks So now that we know things about each of these chunks we can start bringing chunks together in memory and doing stuff to them Right and this streaming is going to ensure some kind of rendezvous All right, I know this was super vague, but we're going to see exactly this pattern on thursday So I wanted to get that out there And this divide and conquer you divide the file into partitions and you either conquer them on the way in Or you conquer them on that second phase. That's going to be a typical pattern. We're going to see over and over okay That parallelizes also but This is where parallelism gets a little interesting and systems like map producer databases or whatever The data starts out all partitioned, but it might not be partitioned in the way you like Okay, it might be partitioned in and you can't form the mega chunks you want to form So what you might need to do is repartition the data So the first thing you might do is set little bits of your data to other machines So in this picture, those are three computers. These are the same three computers on the right hand side Okay, or maybe they're not actually maybe they're three spare computers But they could be the same three computers and what you're doing is you're repartitioning the data And then these guys once it's been repartitioned can act locally Because the things that should go together all the apples are on the top one all the oranges are on the middle one And all the pairs are on the bottom one and that's the first phase is to partition up the data across machines All right, I think we got through everything I wanted to get through today Um, there's a very variety of unix utilities you'll need But that is about it For today, so I'll see you on thursday when we were going to talk about autocorps sorting and hashing Homework. Oh, wait, stop. Don't go anywhere. Stop right there In order for you to do homework zero, you need a course account form The ta's have the course account forms. Don't move. Don't move They will exit the room first They will catch you at the door and rendezvous will happen All right, so guys go to the door do not leave without a course account form No, no, no, no at the door. Let them get to the door Go to the door