 Welcome to Introduction to Data Science. My name is Bill Howe and I'm the Director of Research for Scalable Data Analytics at the University of Washington E Science Institute and an Affiliate Assistant Professor in Computer Science and Engineering also at the University of Washington. So in this first segment, what I want to do is go through some examples of data science activities and projects from the recent past that I found interesting and use them to sort of wet your appetite for the concepts that we're going to learn in this course. Okay. So the first one I want to mention here is the presidential election from 2012 and I know you're probably sick of hearing about this if you live in the United States and even if you don't, you may be sick of hearing about it but bear with me. So this is a map of the Electoral College and each state is colored by the candidate that took the electoral votes and the numbers that represent how many electoral votes each state has. And so if you recall, what was interesting about this map at the time was that it was or that it led to a pretty significant discussion in the media about data science because Nate Silver of the 538 blog was able to predict this map perfectly before the election, alright. And you know that discussion in the media had talked a lot about you know what a genius Nate Silver was and mentioned the sophisticated mathematics he was using and how you know he's sort of a whiz with these things but what I thought was interesting about this was that Nate Silver would be the first one to tell you that the methods he was employing to make this prediction were actually pretty simple, right. And so he says here in a series of quotes from blog posts around that time, this first one from October 26 was the unification behind this ought to be very simple. Mr. Obama is maintaining leads in the polls in Ohio and other states that are sufficient for him to win 270 electoral votes. And then what was funny was you know a few days later he becomes sort of more blunt. The argument we're making is exceedingly simple. Here it is, Obama's ahead in Ohio, right, it's not a magic trick. And so then after the election on November 10th when he was shown to be right and got this sort of flawless prediction, the blog post that this last quote is taken from was describing why he started the 538 blog in the first place. And he says, look, the bar set by the competition was invitingly low. Someone could look like a genius simply by doing some fairly basic research into what really has predictive power in a political campaign. And so what really had predictive power in this case was the state polls themselves aggregated, right. So historically, the state polls aggregated did a pretty good job of predicting the outcome of the general election. And so that's what he did. Now there was some sophisticated work in quantifying the uncertainty and certainly in presenting these results. There's a lot of beautiful interactive visualizations that he created in order to sort of convey these ideas to the public. And that's one of the points I want to make about this is that getting the answer in some cases is the easy part. It's in interpreting the results and in convincing others of the result by presenting them, usually through visualizations, that can be the hard part. And this is one of the themes that we'll come back to throughout this course, okay. So just to summarize that though, I'm not sure I said this. That simple methods plus enough good data wins. That trumps more sophisticated methods in many cases. And that's another theme that we'll come back to. All right, so something else related to the campaign before we moved on to other topics was the system that the Obama campaign used for their data driven ground game, so to speak. The ability to sort of target specific categories of users. And so what they did was they built and maintained a really significantly size, a massive voter database. And used it to design these highly tailored messages to very, very specific groups. So the mother of two in a small town in Ohio who tweeted about the environment and mentioned organic vegetables on her Facebook page. And who had voted in 2008 and had registered on Obama's website but had never donated. Okay, she would get a message from Michelle Obama that highlighted Barack Obama's environmental policies, okay. And so in order to design these messages what you had to do is kind of do ad hoc hypothesis testing about what might work and what didn't. You had to kind of slice and dice this data at kind of interactive speeds. And this is another theme that we'll return to is the need for these kind of ad hoc interactive analysis. And the systems that use for this are pretty interesting too. This was a SQL database, a very fast one called Vertica. And we'll talk a little bit about what makes Vertica special, I hope, toward the end of the course. But it is a SQL database and so SQL sometimes gets a bad name in data science context as sort of the old guard that can't be possibly used for analytics and doesn't really make sense in today's era. But don't believe it, it has a role to play in many cases. And so here they did use Hadoop to do the aggregate generations of anything not real time, he says here. But for the speed of thought queries about the data, they used this Vertica database. Okay, and so we'll come back to systems in several segments from now. Okay, so moving on, this was around the same time. You know, when Hurricane Sandy made landfall, one of the things that struck me was the fact that visualizations of available data were starting to emerge in real time in response to the storm. And there's some very nice examples of people who used Twitter data to analyze the map or to produce a map of where the power was going out. And in this even simpler case, Joseph Ruel got public data from local weather stations and just from two different weather stations and just simply plotted them. And so this is the barometric pressure over the course of this few day period, two days I guess, in two locations, Atlantic City and Philadelphia. And so you can see this enormous dip is the storm passing through. And you can also see the time lag between Atlantic City and Philadelphia, and you can also see the intensity is probably a little bit higher given that the barometric drop is more significant in Atlantic City. Okay, so a couple things here. One is pulling data down from the web and repurposing it sort of in real time, or at least in short time, maybe not real time, to produce visualizations and then publishing those back on the web. I think this is very much the character of data science activities. In this particular example, there's not necessarily a large data set involved. But repurposing data that was collected for a different purpose is a theme that we'll come back to. And again, we see the ad hoc nature of this as well. Fine, so another plot here is wind speeds and they sort of peak out at 40 in Atlantic City, which is still the green. And you can see the Atlantic City is indeed more intense here. And the gray here is error bars. So again, another variance on the same data. Okay, so changing gears a little bit. This was a study, the title here is called The Expression of Emotions in 20th Century Books. And so what they were interested in is have the words that we choose to use in our collective literature changed over time. And eventually does that tell us something about sort of culture or civilization? I find the scientific inquiry sort of compelling. But what I think is most striking about this and why I wanted to include it as an example is that the methodology that they used here is pretty straightforward. You could do this yourself with not a significant background in either technology or in statistics or even in linguistics or anything. And so this is what they did, right? So the first step is kind of a doozy. This is take all the books written in the 20th century and digitize them. Well, that would be a non-starter. None of us could do that, but that's okay. Google's already done it for us, and they've made the data available at this URL. And so you can go check that out. What they've done is digitize the books, done the character recognition on it, and produced these ingram data sets. So these are tables of data where each row has an ingram and followed by the year and the counts of the number of times that ingram occurred. This has already been broken down and processed into a form that's digestible. Okay, and so what's an ingram? Well, it's pretty simple. A one gram is just a single word like yesterday. A five gram, an example here is the phrase analysis is often described as. Okay, and so in this study, they just ignored everything but the one grams. And then they took some subset of those one grams and assigned them a mood score. So how do they do this? Well, you can imagine that certain words are charged with a particular mood or associated with joy or sadness or fear and so on. And you could also imagine that synonyms of those words might also be associated with that mood. And so this analysis sounds non-trivial and it is, but once again, that's already been done for you. There's a resource on the web called WordNet where they've done this kind of affect analysis. And so the authors of this paper were able to take the digitized books from Google, or already broken down into ingrams, and the affect scores from WordNet. And then do this calculation, which may sort of look intimidating if you're not used to staring at these mathematical expressions, but it's actually pretty simple. This is the count of a particular word in the set. The set being the set of WordNet words, which is not as big as all the words. Only some words are able to be scored as mood. And then you normalize by the count of the number of occurrences of the word the. So why do they do that? Well, you need to normalize over something in order to account for the fact that perhaps we just write more books in 2005 than we did in 1937, or we've been able to digitize more books. So we need to normalize by that total. Well, why not just normalize by the total number of words? Well, the reason is that the word the is a better indicator of pros than the total number of words. And this is because apparently we've also started to produce more sort of captions and figures and more sort of technical language and more sort of formula and more expressions, more non-pros utterances in these books. And therefore we can sort of skew the results. So they really want to capture that in our language when we actually write full of complete sentences, how often are these words being used, okay? And then you add those up and you divide by the total number of words in the set. And then there's one more transformation here that should look familiar to you if you sort of recall your high school statistics. And so you subtract the mean and divide by the standard deviation, okay? So this is normalizing with respect to a normal distribution, okay? But that's about it. You know, there's a count and there's a division and then there's two data sets that you can pull from the web. And they're big but they're not exceedingly big. They fit in memory on most of your laptops nowadays. So it's a significant computational task but nothing that requires a dupe. You could do this in a weekend if you had thought about it, okay? So I find that pretty compelling, fine. So these are the results. This is joy words minus sadness words. This is the Z score for joy and sadness. And you can see that there's sort of a big dip after World War II. And that's one of the points they make in the paper. And then you can see this sort of thing start to increase in the late 90s, okay? So I won't try to analyze this for the scientific value. I'll just present the results. What I think is maybe more interesting is this one. So this is now emotion words total minus random words total. And there's a sort of prominent downward slope over time. So what is this, what is this, what's going on here? Well, apparently you can make the argument that we're using fewer emotion words over time, okay? That said, there's a bit of an uptick in this red line and so what does that represent? Well, that's fear words. And you can imagine some of the reasons why there might be an increase in fear words since the 1980s, okay? So this is pretty fun though. This is a significant analysis that can be done just by taking these data sets that they didn't have to prepare themselves, all right? And then the other point I'll make about this, this is just a copy and paste of a segment of some of the papers that this paper cites. And I just was struck by the titles here. You know, quantitative analysis of culture using millions of digitized books, quantifying the evolutionary dynamics of language, frequency of word use predicts rates of lexical evolution through Indo-European history, song lyrics and linguistic markers. I mean, what strikes me about this is that, you know, linguistics, anthropology, history, culture, these studies are becoming hard sciences by the virtue of data-driven methods, right? So all science is becoming data science, right? And therefore data scientists have a lot of power in this regime. It's a great time to be, you know, a data geek. Okay. You know, there's data journalism as well. I mean, one point, I probably should have put a slide in here about this. But, you know, when the WikiLeaks material came out, you know, you're not going to pour yourself a pot of coffee and pour over those materials, you know, print them all out and sort of go through them one by one. You're going to write algorithms that do this kind of an analysis, you know, word use analysis, look for email chains and dialogues, do sort of computational methods in order to analyze that material. And so now journalism itself is a computational enterprise. It's a data science problem, not a, or at least it's amenable to data science technique. So, you know, as a data scientist, the world is your oyster, all right? So let me pause there and we'll pick up with a couple more examples before moving on in the next segment. So let's go through a few more examples. So, if you were asked to decide how important a particular scientific paper was relative to other papers, how much you go about doing that? Well, one way to decide between, say, these two papers here, I'll mark one with a blue circle and one with a red circle, is to wait until other papers start to cite these and count up the number of citations. So here, you know, the paper marked with the blue circle has had four other papers, you know, authors have bothered to read this paper and cite it. So therefore it must have had more impact in the scientific community than the one marked with the red circle. But if we wait even longer, and that's indicated here by this darker blue color, if we wait even longer, you might get even more people citing this intermediate paper. And so maybe we can conclude that, well, over time, ultimately this one had more impact because it, you know, this paper was influenced by this paper and all of these papers were influenced by this one. So therefore, perhaps we change our answer and that this one is more important. So how do we decide between these two interpretations? Well, this problem looks a lot like the problem of judging the relative importance of pages on the web. And so one thing you can do is say, well, look, you know, a particular website is important if a lot of other important websites point to it. And so here, you can say a scientific paper is important if a lot of other important papers point to it. And this method that Google proposed and implemented and ultimately led to a pretty significant part of their success was this algorithm page rank. And so page rank did exactly this. You add up all the weights of your neighbors and give them to yourself and then pass on that weight to everybody that you link to. And you keep going with this until you reach some convergence condition and you end up with the relative importance of these pages. And so this is a method that comes up quite a bit. Whenever you have a graph, it makes sense to potentially run page rank algorithm on it even though it had nothing to do with ranking things on the web for which it was originally designed. Using the same data set, Carl Bergstrom and Martin Rosewall created this visualization. So ignore the importance question. Just think about the graph of the citation network and doing analytics on it. One kind of analysis you can do is judging importance. Another kind of analysis you can do is this. And so what this is is over time run a clustering algorithm, which I haven't described what that is, but it groups clusters of similar documents to and then map those clusters into the relative these fields. And so you can determine that with some confidence that this cluster represents medicine because it has the Lancet and it has other kinds of medical journals in it. And you can conclude that this cluster represents molecular and cell biology. What's pretty striking is that if you lay this out in this timeline like this, you can see that some fraction of the molecular and cell biology community and the neurology community started to combine to form a brand new field of science called neuroscience. And so just by doing this kind of analytics on this graph, right, to do this data science on this graph, you can uncover the emergence of new fields of science. And so I found that pretty striking. And they've gone on to do many other kinds of analysis on this same data set. And the overall field of this kind of meta-field of studying the scientific literature in order to draw inferences is called bibliometrics. This pen is a little wonky. Bibliometrics. Sorry for the bad view there. OK. So what is not amenable to data science? Well, you might think food. And this is a paper in a fairly respectable journal that applied some data-driven techniques to analyzing food pairing. OK. So what they did here is pretty interesting, right? So induce a graph on the ingredients by saying that if two ingredients appear together in a recipe, you draw an edge between them, a graph meaning vertices and edges. And if you're not familiar with graphs, you will be by the end of the course, but bear with me right now. So connect two ingredients if they appear together in some recipe. OK. Build that big graph. And now you can analyze it in ways that are similar to what we just talked about. You can look at the community structure. You can find the clusters within this graph and see if these clusters correspond to some of the well-known methods of food pairings. OK. And in some cases they do and in some cases they don't. And the authors sort of show that they've uncovered things that were not necessarily known but appear to be there in the data, right? And so this data-driven approach, you know, the fact that there's websites full of recipes online has allowed us to put things on a more quantitative basis that were previously simply, you know, old wives' tales essentially. OK. So I thought that was kind of a fun example of an unusual use of data science in a respectable journal. All right. So another example is from the Last FM blog. They looked at the tags associated with the songs on Last FM and used it to do some simple analysis of the emergence of genres over time based on the popularity of the tags for songs coming from that time. And so you see things like, well, post-punk in red here came after punk in purple which you'd hoped the graph shows they'd expect. The other thing we also see is kind of a rise here in rock and roll over time and then maybe a bit of a dip more recently. So this is kind of interesting. But the other theme here we have is that we mentioned in the previous segment is repurposing data, right? So this data was collected simply to help with search, you know, find similar music and it's now being reused to sort of draw inferences about the emergence of entire genres. OK. So another example here that you may be familiar with, Google was able to show that by analyzing the search logs, the frequency of search terms, it could do a better job predicting the severity and the scope of flu outbreaks than the Centers for Disease Control. And by better here we mean essentially earlier, right? It was able to give more of a head start to the health community, OK? And also it was sort of more accurate. So how do they do this? Well, you know, when you're getting the flu, it turns out that you want to search for flu symptoms, terms associated with flu symptoms more often. And by watching that uptick, you can predict that the flu outbreak is coming, OK? So that's great. And they put in a, they, they, this worked and they sort of published a paper about it and then they put up this interactive visualization that allowed people to sort of analyze flu outbreak going forward. But just this year, you know, some folks showed that it didn't do a very good job in this last year. So a scientific hindsight shows that the Google Flu trends far overstated this year's flu season. The reason they think explains this is that there was lots of media attention associated with this year's flu season because it was a bit of an uptick. And so it got amplified. And so this caused people to search for flu-related terms more often, perhaps because they're worried about experiencing the symptoms, perhaps because they are trying to understand more about the flu outbreak, perhaps because they're searching for articles, perhaps because they're worried about their kids more. But it all was, it was a second-order effect based on the media attention on the problem, which led to skewed results and ultimately a wrong answer. And so the point here is this is great that they're repurposing data from the search engine to try to make predictions about something else, but it's, it's, it is biased data. And so you have to be careful with what you conclude from it, OK? So there are limits here, all right? So another example, also, also analyzing web search traffic that was done with perhaps a little more of a scientific rigor was done by some folks at Microsoft Research. And so here what you're looking for is side effects associated with particular drugs. And so these results are pretty striking. So what this graph is showing that is, over time, a set of users, it was around about a million that had, they had permission to sort of monitor their, their web search traffic, that when you searched for this drug in green, what percentage of the time did you also search for terms associated with hyperglycemia symptoms? And the answer is somewhere around 5%. For this other drug, it was somewhere around 4%. For the, in the background, for the average case, it was pretty close to 0%. If you searched for both of these drugs, the odds that you also searched for hyperglycemia symptoms went up to 10%. OK. So what's striking about this is that hyperglycemia is not a known side effect of these drugs, but it seems impossible to ignore from the web search data, right? There's just no reason to believe that this could be explained by coincidence. And they developed this argument more in the paper than I just have there. But so fine. So repurposing data, this is another example, a large data says they had to use a web search. And those are probably the two points I want to make about that, but a pretty fun one. OK. So the last example I'll give is a different take. There's more about prediction than data. But this from last October, if you recall, there were six Italian seismologists who were convicted of manslaughter for failing to predict a magnitude 6.3 earthquake in April 2009. And so while the locals were concerned about the seismic activity, the researchers were deemed to be just too reassuring about the verdict. And so the point I want to make here is that there's liability. I mean, so this scientific community was completely aghast that this happened, and I'm completely aghast that this happened, and pretty much everybody is, that you can imagine to hold researchers responsible for failing to predict something that is demonstrably and known to be impossible to predict, right? So there's no seismologists on the planet that would argue that earthquakes are even remotely predictable. And yet the courts sort of decided that somehow they, you know, because they got the wrong answer, it's bad. But it does sort of bring up the issue that when you make a prediction, there's a certain amount of weight you're going to put behind it, whether intentionally or not. And so understanding how confident you are about that prediction is sort of an important part of the game here. OK. So that was the last example. The themes that we saw come out here. We gave a couple of examples of graph analytics. We showed that databases are sort of useful in the Obama-Grounding case. We saw a lot of examples of visualization and communicating the results, interpreting the results. We saw some examples of using very large data sets, other examples that use very small data sets, and not everything's about big data. A couple of bullets that aren't on here. We talked about, you know, ad hoc interactive analysis is sort of not just faster but different. And so supporting that is important. And then we talked about repurposing data, right? So data collected perhaps by someone else for some other purpose, reusing that to draw inferences about something else. That's a pretty common theme here. OK. In the next couple of segments, we'll talk about how we organized this course and some of the design decisions we made in creating the material. So I want to talk about in this segment what this term data science actually means. So, you know, you'll see these quotes around the web. So in Fortune Magazine, they talk about data science being a hot new gig in tech and Hal Varian, who was Google's chief economist in the New York Times in 2009, which is a while ago now, talked about statistics being the next sexy job, and described it as the ability to take data, to understand it, process it, extract value from it and communicate it. That's going to be hugely important. Another person who's prolific in thinking and writing in this space is Mike Driscoll, who's the CEO of a company called MetaMarkets, and he talks about there, the data science, you know, as it's practiced is this, you know, as this colloquial view of it is a blend of red bullfueled hacking and espresso-inspired statistics. And maybe another quote of his is the data sciences that is the civil engineering of data, who, you know, whose acolytes possess this practical knowledge as well as a theoretical understanding. And so this balance between pragmatism and theory is something we'll come back to. So another perspective on this that you should be familiar with is this VIN diagram that made the rounds several years ago by Drew Conway. And what he's, his point was that data science is perhaps the mix of three different sort of areas. One is hacking skills, you know, programming expertise. Another is the academic view, you know, the math and statistics knowledge. And then the third that he added is this notion of substantive expertise. And what he meant by this is kind of a deep investment with the data. So if you think about your typical IT shop, they're typically building tools for other people to use to actually analyze the data, but they don't necessarily do the analysis themselves. And, you know, a data scientist, in contrast, may be a participant in tool building, but they're going to also sort of dive deeply into the data and do the analysis themselves. And that's one of the ways that I like to interpret Drew's blue bullet here of substantive expertise. Okay. And then he also sort of fills in the gaps between these, talking about that I'm not sure I'd agree that this is necessarily traditional research, but perhaps it is, is applying the statistical knowledge to a particular domain is where research comes in and that some deep theory plus some pragmatic programming makes you a machine learning expert. And then he sort of refers to this as the danger zone where you sort of know enough about the domain to be dangerous and you know enough hacking to be dangerous, but you don't know how to ground your analysis in proper theory. And, you know, some of my colleagues like to joke that computer scientists don't understand error bars, right? And that's going to be what they're referring to here. Okay. So fine. So what do data scientists actually do? Well, there's some more quotes here from EMC, a company who acquired a company called Green Plum and recently had a pretty significant initiative in data science, both related to a product called Green Plum as well as other products. So they need to find the nuggets of truth in data and then explain it to business leaders. And I like this, you know, and then explain it to business leaders. Something else we'll come back to that I mentioned previously is that communicating the results is a critical part of this process. It's not just getting the result. Okay. And then another view of this, which is sort of interesting, is that DJ Patil talks about data scientists tend to be hard scientists, maybe coming from a physics background who have a strong mathematical background and computing skills and, you know, come from a discipline in which survival depends on getting the most from the data. They're really used to sort of torturing the data to extract every last ounce of value out of it. This, you know, DJ has a applied math background, so he's maybe coming from that perspective. So Mike Driscoll also talks about the three sexy skills of data geeks, which are statistics, this thing data munging and will kind of, you know, this funny word munging, and we'll kind of return to this. You'll see a lot of this sort of colloquial language and I'll give my perspective on what I think that tells us about the state of the world in data science. And but what he means by this, as you can imagine, is sort of parsing data and scraping data from the web and converting to between different file formats efficiently and not getting hung up on these, the kind of friction that you deal with when you are working with large and heterogeneous data sets, right? So a data scientist is someone who's very comfortable in that environment and is able to sort of work nimbly, even when things aren't very clean. Okay. And then finally visualization is another sexy skill, right? The ability to communicate the results through visualization. All right. So another quote from Jeffrey Stanton who teaches a course in data science at Syracuse and has involved in one of the earlier programs in data science, talks about the emerging area of work concerned with the collection, preparation, analysis, visualization, management, and preservation of large collections of information. And I think one thing that's interesting about his perspective is it includes his word preservation. So while the, you know, preparation, the analytics, and the visualization are the three tenets that you see quite often. It's something that we like in this course as well. You know, Jeffrey goes one step further and talks about preservation. So even after you're done communicating the results, what do you do with the data long term? And that's part of the reason is he's got a background in library of science and information studies where they're very concerned with the curation of data. And so that's something we're probably not going to emphasize too much in this course, but it's definitely part of the overall data life cycle, if you will. Okay. So another quote from a thinker in this space is Hillary Mason, the chief scientist at Bitly. And so she says the data scientist is someone who can obtain scrub, explore, model, and interpret data. You know, blending, hacking, statistics, and machine learning. So we saw that before in Drew Conway's diagram, this blending. And data science is not only a depth at working with data, but can appreciate data itself as a first class product. And I think she's talking about there maybe is being able to organize the data and actually produce something that's usable by other people. Right? Okay. So having a quality data resource, a data asset that others can use to answer questions is one of the outputs of data science. And then she talks about, let's see, scrubbing. And we saw munging on the previous slide. And so you'll see some of these terms. You'll see data jiu-jitsu, people will say. The data scientist that I want to hire is really good at data jiu-jitsu. And we don't, in a couple of segments I'll talk about, or the next segment I think, I'll talk about what I think this means. Okay. So to summarize what we talked about, there's perhaps three overarching tasks involved in data science. And those are preparing to run some sort of statistical analysis, actually running that statistical analysis, and then interpreting and communicating the results. And this phase here, this preparing to run the model, is where we see all that data munging, cleaning, manipulating, integrating, and so on. All right. So another view that I like to take is that data science is really about data products. Producing data products that you may use yourself or that others may use. Okay. And so what do I mean by data products? Well, this could be data-driven application. So if you think about a spell checker, right, this is not just a piece of code. This is not just a piece of software that does something. This is only enabled by a dictionary of words and a dictionary of misspelled word, actually. Okay. And similarly, a machine translation where you can type a sentence in French and have it automatically translated into Arabic relies on not just clever algorithms, but rather an enormous corpus of French text and Arabic text. Okay. The second kind of data product is interactive visualizations. We saw this example with the Google Flu application. And there's a suite of visualizations on the web that I'm not going to show right now associated with the global burden of disease. And these are visualizations produced by the Institute for Health Metrics here at the University of Washington. And what I like about it is it's an example of where they've done a lot of research, but then the output was not just a research paper, although they wrote plenty of those as well. It was these interactive visualizations that allowed you to explore the data as well. And so this, I think, captures this notion of producing not just the answer or not just a paper in perhaps this case, but a data product that is usable by others. Okay. And then finally, another kind of data product might be an online database of some kind that others can actually use to conquery and answer their own questions. And so maybe there's not necessarily a visualization component, but the work that goes into producing these things, I would argue, is part of data science. And so this captures some of the enterprise data warehouse work and software and effort, of which there's a lot. And including business intelligence work and a couple of segments, I'll try to differentiate those two. Another example of an online database from science, as opposed to business, is the Sloan Digital Sky Survey, which I'll talk about in more detail in a couple of segments. So again, just to summarize here what's in red, data science is not just about building data products, it's about building data products, not just answering the questions once. And what data products are are assets, digital assets that empower others to use the data in new ways. And so they may help communicate results, for example, need civil abilities maps, or they may empower others to do their own kind of analysis, say with a data warehouse or with a visualization. All right. So let's talk a little bit about what distinguishes the term data science from other related fields. So one related field is business intelligence. Business intelligence systems are associated with a couple of concepts. One is a data warehouse, and the other is a set of dashboards or reports that consume data from the data warehouse and are used to answer particular questions. So both of these components require a lot of upfront effort to design and build, and are, therefore, not too adaptable when requirements change. OK. And so therefore a software stack designed for business intelligence may or may not be appropriate for any particular data science problems where changing requirements are considered the norm. And so a sort of warrants a new term is that business intelligence became associated with a particular approach to a particular set of problems, and a data science is in some sense broader. OK. The other point I like to make about business intelligence is that the BI engineers are not typically expected to consume their own data products and perform their own analysis and make the business decisions themselves. Usually they're building tools for others to make decisions with. As a data scientist, you'll be doing both. So what about statistics? Well, statistical methods are at the heart of what a data scientist does day to day. But a statistician will typically be comfortable with assuming that any data set they encounter will fit in main memory on a single machine. And this makes sense because the whole field was born out of the need to extract the most information possible from a very sparse, very expensive to collect, and typically, therefore, a very small data set. So if you only have 20 patients in the world with a particular disease, you can't just go find 20 more cheaply. So therefore, you need to come up with new mathematics to squeeze as much information as you can out of the 20 you already have. But that's not always the problem anymore. So as we shift from a data poor regime to a data rich regime, the set of challenges move from the need for new mathematics to squeeze information out of a data set to new engineering to even handle or process very, very large data sets. However, some of the methods, some of the models that you'll build are the same in both cases. So database experts, database programmers and administrators bring a lot of skills to the table to make them appropriate for data science tasks. But there's a focus on a particular data model, which is usually the relational data model. So this is rows and columns. So if you have data coming from sources that are as video or audio or even text or to some extent even graphs, nodes and edges, which we'll talk about, a relational database may or may not be the right tool. And even the concepts that transcend any particular database system may or may not be appropriate. And we'll sort of explore when and where it isn't appropriate as we get into the course. So visualization experts also bring a lot of skills at the table. But like statisticians, are historically less concerned with massive scale. Data that spans many hundreds of machines. And then finally, machine learning is perhaps the closest to data science. But here, and we'll try to make more of a point about this later, as a proportion of the time you'll spend on a data science problem, actually choosing the right model or algorithm, machine learning technique and applying it and running it is a fairly small fraction. Which we'll be spending much more time on is the preparation of the data, the manipulation of the data, the cleaning of the data, the wrangling of the data, as some have been saying. And for this, machine learning techniques are not particularly relevant. And so this falls back more towards the database managers, the database experts and database programmers. OK. So there's a lot of courses that could be considered data science courses. Some of them use data science in the name, some of the newer ones. Others have been around for a long time, but are obviously in the same space. And so I want to spend a little bit of time describing the dimensions by which you could describe these courses and then choose a particular point in this design space that we've used to motivate this course. So as a preface, let me show you this quote from Aaron Kimball, who's a CTO at WebData. And so he said to me that he worries that the data scientist's role is perhaps like the mythical webmaster of the 90s, that they were expected to do everything. The web manual, companies knew they needed to get on the internet in the mid-90s, but they didn't know how. And so they said, well, we'll hire a webmaster, problem solved. The webmaster will write all the content for the website, they'll do the design and manage the user experience, they'll write the code that will wire the website to the order fulfillment system in the back end, they'll actually structure the pages and do the navigation, they'll do the logging required to make sure that the site stays up all the time and has reasonably high availability, they'll design the schema to hold the data that will be served out through the website and so on. And so it wasn't really feasible that you're going to get this in a single person. And so instead the internet strategy became a broader team. Similarly, that might be what we see happening with data science, but here's what it means to me. The term data science tells me that if you're a database administrator and your skills are solely about relational databases, the current trend is you will need to learn more about unstructured data and statistical modeling. If you're a statistician, you will need to learn to deal with data that does not fit in memory. If you're a software engineer who's used to building systems and working with files directly, you'll need to learn some statistical modeling and how to communicate your results to your managers. You'll need to work with these data sets and actually use them to make decisions. And if you're a business analyst who is trained to make the decisions based on data, you're going to need to start understanding a little bit more about the algorithms and trade-offs, especially at scale. And for a couple of reasons. One is the cost change dramatically based on the technology you're picking. What's happening with cloud computing that we'll talk a little bit about and what's happening with these algorithms is that you might be able to get an answer, but it may cost more or less than it did five years ago. The other reason is that as we do more fly-by-wire business, meaning we trust algorithms more and more to make some decisions for us, they become these opaque black boxes. And if you don't understand what's going on inside that black box, you're bound to miss the results. And so it's no longer safe to just sort of throw your trust over the wall to some algorithm or to your staff that's running these algorithms. You may need to sort of understand, internalize the trade-offs and choosing one model versus another yourself. OK. So here are the dimensions by which I like to describe these different courses. The first one is breadth. And so I divide breadth into tools versus abstractions. And so every sophisticated course would prefer to cheat towards abstractions. You want concepts that transcend any particular implementation. However, what students are interested in is hands-on experience using tools they can use tomorrow at a job. And so you always have this tension between these two. All right. And so some examples here are hey-doop, which we'll talk about is an implementation of an abstraction called MapReduce. And then the MapReduce abstraction certainly transcends its particular implementation in hey-doop. And so here, as I'll mention in the next segment, I want to cheat towards abstraction whenever possible, but make sure that there are assignments that give you the hands-on skills that people are interested in. The next dimension here is depth. And so by depth, I intend the distinction between structural manipulation of data and statistical manipulation of data. And so here you can think about the relational algebra as a structural formalism, a formalism for manipulating data structurally, while the linear algebra is perhaps a formalism for manipulating data statistically. And so here, try to strike a balance, but I actually lean more towards structure. And I'm going to defend that position in the next segment. The next dimension you can think about is scale. And so here is a one-end is yes, it fits in main memory on a single machine versus what I'll call cloud, meaning that you might require hundreds of machines to work on it. And here I cheat towards cloud. And the reasons that I've already sort of described are that it's no longer safe to assume the data fits in main memory and to train people to work only with data of that size. The whole world changes when you start moving to even two machines, let alone 100. And to not give you some exposure to that change would not equip you to be an effective data scientist. And then the final dimension I use is sort of the target audience. Is this for hackers or more for analysts? And by hacker I mean you already have significant programming experience and you're looking to sort of round out your skills in some of the mathematics. Or are you more of a technology decision maker who is trying to bring a little bit of technical depth? And here I like to actually strike a balance. I don't want this course to be solely assuming that you are a seasoned developer, but nor can we sort of ignore all programming whatsoever. So we're going to try to strike a balance between these two. All right. So here's the choices we sort of made in this course. So one is we cheat towards abstractions. We cheat towards structs. We definitely like large scale. And then I say we'll strike a balance. But we actually cheat towards the analysts side. We favor the fact that there are going to be analysts in the room who don't necessarily have significant experience. And I've already gotten a lot of questions from folks over email who say, hey, look, I haven't been doing programming day to day. Am I going to be able to take something away in this course? And I think the answer is yes. Although there will be some programming, so be ready. Welcome back to Introduction to Data Science. So in this segment I want to talk about these four dimensions that I introduced last time. And I want to justify the first three of them. And we'll talk about this one next time. And by justify what I mean is I want to explain why I've positioned the needles at the locations I have for this course. So the first point here is this dimension of tools versus abstractions. And this may seem sort of obvious that we want to focus on fundamental concepts as opposed to specific tools. But I can appreciate that people that are taking this course and many other courses really want sort of hands on experience. And we're definitely going to try to strike a balance. But let me try to motivate why I think it's important to sort of focus on this angle. And to do this, let me tell one particular story that you see happens sort of time and time again. So in this case, we're talking about sort of databases and what is currently going on in the NoSQL systems. So before 2004, you had the big three relational database vendors plus some open source solutions like MySQL and PostgreSQL. And then arguably a big event in 2004 was when Jeff Dean his colleagues published this paper on MapReduce at Google. And if you haven't heard of MapReduce, we'll talk about it at length. And if you have, bear with me. So this was great. What this will allow you to do was process very, very large data sets. And it sort of rebooted the database feature set. So it really stuck on, it really focused on just scale out parallelism. And that's it. None of the other features of databases. And this was exciting to a lot of people because they didn't have to sort of deal with the extra features that they didn't need in databases, nor pay the exorbitant license fees associated with databases. This seemed like, boy, this is the right solution. And it took a few years, but a few years later you had an open source implementation of the ideas in this paper called Hadoop led by some folks at Yahoo. Now even in the same year, one of the earliest and most successful projects within the Hadoop ecosystem was this system called PIG. And what PIG essentially was was a relational algebra programming environment for Hadoop. And if you haven't heard of relational algebra, don't worry. We'll talk about it. But let me convince you that you notice the word relational there. Relational algebra is the secret sauce within relational databases. And so a very early on project that was deemed necessary in the Hadoop community and was wildly successful was to have relational style programming on top of this non-relational system. Moreover, you had other competing projects like Dryad Link, which well Dryad and then Dryad Link, which is an interface to Dryad, which also provided a relational algebra-oriented programming environment for large scale out parallel data processing applications. Then you had people literally put the language SQL on top of Hadoop. So instead of just the underlying formalism, it literally had the programming language you needed. Then a bit later, you had indexing for Hadoop, which is another feature that databases have that we'll talk about. You had people talk about schemas and more sophisticated types of indexing, which is two other things that databases have. And then you now start to see transaction processing being a very hot, very important topic in no SQL systems. How to support concurrent access at very large scale transparently. And this slide is perhaps a little bit old. It's now 2013, of course, at the time of this recording. And the Spanner system from Google is an important system to look at that we'll talk a little bit about later. OK, so now this isn't to say that MapReduce was useless. We're going to talk about it at length and for a very good reason. It actually has some pretty important permanent contributions, three of which I mentioned here. One is it was the first system to really emphasize fault tolerance. And the idea here in a nutshell is that when you're working with 1,000 computers at one time for any length of time at all, a few minutes, a few hours, the odds of one of them failing in some way is extremely high. And so databases didn't typically have to worry about this, because first of all, they weren't running on thousands of computers at once. And second of all, they were under the assumption that your queries would typically be pretty fast. And so fault tolerance during query processing, so you don't lose all the work you started on when you're running a query, was something that the Hadoop and the MapReduce paper really sort of emphasized and has now been sort of accepted by the larger community. The other notion which is a little more subtle is this idea of schema on read. And what I mean by that is the way databases worked in the past and largely still work is we designed this thing called a schema, which is a particular structure for your data. And then your job is to fit your data into that schema. And until you do so, we don't really want to talk to you. The database has nothing to offer you until you're able to sort of fit into some sort of a schema. And the observation was that, well, look, a lot of data doesn't come pre-equipped with a schema. We don't have a schema just sort of lying around. We have to do something with it. And it's huge. It's many hundreds of terabytes or something. So what do we do? Well, one answer is you can use these MapReduce databases and say, hey, do for this. But having to say that before you're allowed to touch your data, you must load it into a database, that was kind of a non-starter for a lot of applications. And then finally, this idea of user-defined functions is something that most databases support. And it's the idea that you might want to do things outside of what you can do in a normal SQL query. You might want to write your own code and push it into the database. But the experience of having to sort of write and maintain and manage and use these things is not great. And that's why a lot of people put their logic inside the application, as opposed to pushing it down into the database, where arguably it could do more good, for reasons we'll talk about. And so I think MapReduce sort of argued that, look, you can give the Java programmers what they want, a Java programming environment, and let them write scalable systems without forcing them to kind of use this crazy user-defined function interface that databases offer. So what's my point over all this? Well, if we focus too much on tools, what you would get is a snapshot in time of what tools are important, as opposed to seeing that some of these features around databases, they sort of ebb and flow in their popularity. But they're sort of a permanent value in your reasoning about large-scale systems. Similarly, you might lose track of what's actually novel and what's actually new in the midst of the conversation about sort of relational databases versus no SQL systems. So I want to focus on these abstractions throughout the course whenever we can. Great. Now, OK. So finally, we're going to focus on the abstractions. What are the abstractions of data science? Well, it's not clear that people really know yet. And I'll give my case for this in the next slide. But the reason why I don't think we really know yet is do you see these words being used like data jujitsu and data wrangling and data munging? And this is the real skill of data scientists that they have to be able to wrangle data. What does that mean? So my translation of this is we don't really know what we're talking about yet. But that said, there's probably a few candidates we can consider here. So maybe everything's a matrix, and everything we want to do with data, it can be expressed in linear algebra. If you're a database person, maybe everything is a relation, and everything you want to do is expressed in relational algebra. If you're more of an object-oriented programmer, everything's an object, and we communicate between objects by sending messages back and forth through by calling methods. If you're more of a sysadmin type, then everything's a file, and we write bash scripts to process it. And if you're an R programmer, then maybe everything's a data frame, and we call functions in this library. And MATLAB, similarly, perhaps from MATLAB, everything's an array, or a matrix, or a vector, I guess, in their parlance, and everything's a function on that. So I think of all these possibilities, there are two that stand out as likely candidates as fundamental abstractions for data science. And those are the first two here. And the reason is that we see these abstractions appear over and over again, independent of particular tools. Now, relations and relational algebra are closely associated with databases. But as I argued a few slides ago, and as we'll see throughout the course, you see this come up time and time again. And you even see it in, say, object-oriented languages, and you see it in R and so on. So these are the two that we'll mainly focus on in this course. So now I want to motivate desktop scale versus cloud scale. And the argument here for desktop scale is that, well, data science is really about the functions and the statistics and the manipulation and the techniques. So therefore, we can push large scale data into a separate course or a separate category and really just focus on the math and the functions. And I think this is a bit of a mistake for a data science course. And the reason is that this is a fundamental limitation of a whole category of technologies. And R itself is included in that, although there's a lot of great work on how to sort of scale R up. But in its basic usage, what you do with R is you read a file, load the whole thing into main memory on one machine, and then call functions on that. And so if your data doesn't fit in main memory on one machine, you're kind of out of luck. Now, you can be clever and start to use indices to kind of limit the data that you need to access. And you can start to try to be parallel to take advantage of the fact that there's now four and six and eight and 12 cores in your computers you'll buy nowadays. But trying to be clever in doing that yourself overlooks the fact that a lot of these techniques are pretty well understood and already implemented in other systems. So being able to be cognizant of what other systems can do and take advantage of those flexibly and write your application in terms of these other systems that already do scale out is a critical skill in data science. And so the point being made in this slide that is somewhat out of date, although you can get the idea, is that simple tools that are available in every machine, such as grep, which if you haven't heard of grep, if you're a Windows user and don't use grep too often, this is essentially search a file for a particular pattern. But it searches it linearly, right? It looks at every single line of the file and checks for the pattern. And so you can do a linear scan of a megabyte in maybe a second and a gigabyte in a minute and so on. And so at very large scale data sets, you can't do this linear scan anymore. You have to search in a more, in a smarter way. And he sort of has some costs over here that are probably horribly out of date now. Fine. So the point is large scale data is not just bigger, it's different. It requires a different way of thinking about techniques and it requires a different stack of technologies and to ignore it's a mistake to ignore that in a data science course. And then this final dimension of hackers versus analysts. And again, what I mean here is, am I going to require sort of deep programming proficiency in order to participate in this data science class and in general in the data science kind of activities? And the answer is I don't think so. I think we need at least two types of people and really sort of a broad spectrum of people. And this isn't really my idea. This often quoted report from the McKinsey Global Institute, you'll see this quote time and again. But the people that use this quote tend to focus on this first part that talks about 140,000 to 190,000 people with deep analytical skills. But the second part of the quote is, well, you also need 1.5 million managers and analysts who know how to use the analysis of big data to make effective decisions. So this means that it won't be just the programmers who are working in this space. And I wanted to sort of think about how to design a course that could appeal and inform both categories of people. And this is my last side of this segment. The other reason why I think hackers versus analysts is that the line between them is kind of blurring nowadays and technology can actually help here. It doesn't require a PhD in computer science or a bachelor's degree in computer science even, in some cases, to manipulate large data sets. And in order to back up this claim, I'll give you an example from some of my work where we have done some work to try to make databases easier to use for, say, biologists. And this really nasty looking SQL query that if you squint closely, you can see that it's actually doing interval arithmetic over genetic sequences. This is a pretty tough query to understand for even experts. This was written by somebody that doesn't do any programming whatsoever. She doesn't write a line of pythons. She doesn't write a line of purls. She doesn't write a line of R. And she's able to write these SQL queries that process very large data sets. So the fact that if you understand what's going on and if you can think in terms of some of these abstractions and you understand your problem well enough, you can participate in the activity of manipulating large data sets and doing data science, even without a deep background in software engineering. And that's why I want to push this needle somewhere over this way. I probably put this in the middle, I suppose. I'm not so much trying to focus on only the analysts. I just want to make sure that they're included. OK. Next time we'll pick up with the last dimension. Welcome back. Last time we talked about three out of four of these dimensions in describing how we designed this course in data science. And so in this segment, I want to talk about this last dimension of what I call structs versus stats. And so this is the relative importance of data manipulation versus deeper mathematics. And you can see that I've sort of put the dial here a little bit to the left. And I'll try to motivate that in the next few minutes. All right. So we already saw one example of this in the first segment where I gave some examples of data science from recent history. And one of these was Nate Silver's prediction of the electoral college votes for the 2012 US presidential election. And if you recall, this prediction was accomplished by essentially taking the average of the state polls for each state. So it didn't really require a sophisticated statistical model. And yet it had massive impact. So a quote that I think sums this up a little bit comes from Aaron Kimball at a company called Weeby Data. He says 80% of analytics is really just sums and averages. And so what he means by this is if you can get these sums and averages right, if you can do it at any scale on any data that you might see, then you can always sort of build up more advanced techniques. Everything sort of boils down to just sums and averages. So I think this is a motivation for why focusing on data manipulation, which typically is associated with being able to express sums and averages. For example, what you can do in a database query, which we'll talk about in the next couple of lectures, that gets you a pretty long way. It gets you 80% of the problem. So another way of looking at this is that there's three main tasks involved with a data science project. There's preparing to run the model, running the actual statistical model, and then interpreting the results and communicating it. I got the animation out of order here so you can ignore that red. But the point here is again Aaron Kimball from a conversation with him is where I got this. 80% of work is really in this first step where gathering data and cleaning it and integrating it and restructuring it, transforming it, loading it and so on. So verifying all these verbs you see here, this is the hard part. And so actually running the model, or even choosing the model and then running it, doesn't tend to keep people up at night in practice. So and then the joke here is perhaps that the other 80% of the work, implying that there's sort of 160% of a normal task as in data science, is in this interpreting the results. So this is the visualization and the communication and the explanation of the results, okay. So this is another reason why I wanna focus in this course on data manipulation tasks that are associated with this first number one task, okay. You know another way of looking at this is a quote that's now really old, right? So this is 12 years old or so at the time of this recording from Doug Laney. And this is the document that first coined this notion of big data being the three of these of volume, velocity and variety. And we'll talk about that in a couple of segments. But he has this quote, you know, no greater barrier to effective data management will exist in the variety of incompatible data formats, not aligned data structures and inconsistent data semantics. So this, what the database community, you know, my community calls the data integration problem. This is the hard part. And so he was saying this back in 2001 and I would argue that it's still true today. This is the greatest barrier. And then in the context of this, he was talking about this notion of variety being harder than volume or velocity. And I'll explain more about those, what those V's mean in a couple of segments. All right, so another vignette here is something that we like to ask the scientists we work with. So these are, you know, astronomers, oceanographers and biologists. We ask them sort of informally how much time they spend quote handling data as opposed to quote doing science. Now, you know, we let them interpret these quotes however they want. But what we mean by doing science, you know, choosing a statistical method or designing a statistical model, they absolutely consider part of their science. And so what we mean by handling data is all the other crap, you know, the format conversions and so on. And so what do you think the most common answer is here? You can guess to yourself for a second, but they don't even blink. They say things like 90%. And so this number should, you know, give pause, right? This is taxpayer money that goes to federal funding agencies to come back to pay some postdoctoral fellow to spend 90% of her time doing something that she doesn't even consider science. And so this is why I think it's really important as a data scientist to focus on this problem. Now, you might say, well, that's just science. What about business? But as I will try to make the point throughout the course, there's an increasing alignment between what's going on in business and what's going in science. Okay, and we'll talk about that more in a couple of segments. All right, so if 90% of the problem is handling data, you know, boy, we ought to spend a lot of attention on that. All right, so, you know, another argument that sort of follows on the first slide that I gave is that structs, you know, the data manipulation platforms and databases in particular actually go a pretty long way to being able to express more advanced things. And this isn't just a matter of, oh, well, you can express anything if you have sums and averages. It's also even fairly advanced techniques. You can, there's an increasing amount of interest in figuring out how to get this stuff into the database. Okay, so this is a slide I'm taking from Christian Grant where he argues that, look, you know, if you consider databases versus statistical packages such as SAS or MATLAB or R or SPSS, you know, this is what they're doing now. They're downloading data to use in their favorite statistical package frequently under the assumption that, well, of course I have to, right? Of course that's the only thing that could possibly express this. Well, look, you know, most of these stat packages, the first thing you'll do is read the data off of disk and load it into memory and then start calling functions on it. Well, if increasingly data sets simply don't fit in memory on a single machine, certainly not on your laptop. And so you have a couple of choices here. Either you shift into some kind of fancy cluster version of the tools for which they exist for things like SPSS and MATLAB, although they're quite expensive. Or you sample the data so that you only can work with a subset that actually does fit in memory. And you'll see this to be very, very common, is that it's just a part of the course to take a sample of the data in order to be able to work with it efficiently. Right. But the point here is that this isn't really required if you use different packages. If you, in this case, the argument here is that if you can use databases, if you can figure out how to perform your task in the database, you'll get the scalability for free. Moreover, these toolkits don't necessarily have any kind of notion of parallelism, right? So even if it does fit in memory, every machine you buy nowadays has at least four cores and then probably more like eight and soon to be 12 and 16. So to take advantage of all those cores on your problem is something you're gonna be looking for in a package. And this is something that databases can do automatically. Most databases, not all. In fact, the ones you may be familiar with, MySQL and Postgres typically do not, but other databases will and we'll talk more about this. Okay, so you get parallelism for free if you can use a database and you get scalability beyond the size of main memory for free if you can use a database. And that's perhaps a big if and we'll talk about it. Okay, so let me give you an example here and you'll actually do this as part of a homework assignment. But can you express matrix multiplication in SQL and if you can, then I'd argue, well, hey, now any formula that you can express using matrix multiplication, you can perhaps express in SQL by doing this over and over again. Okay, and the answer is yes. And in fact, the simplest version of this is pretty straightforward. So if you haven't ever seen SQL before, don't worry, we'll talk a little bit more about this. But if you have, bear with me, imagine you have two matrices A and B. Oops, I'm using the wrong device here. Two matrices A and B. And what you want to do is find all the, you know, and the representation of each matrix here is as a row, excuse me, row ID, column ID and value. All right, so that's your relation. Now, this is a very inefficient relation if your matrix is dense. And I'll let you think about, well, I'll tell you why and you can think about it a little bit more as well, is that, you know, an implicit representation of this only has the, let's say you have five rows and five and six columns, then you only need the 30 values, five times six. But here you're doing, you have to do 30 row IDs plus 30 column IDs plus 30 values. So you sort of triple the size of your data relative to, you know, efficient main memory representations. So why would you do that? Well, it turns out that a lot of matrices in practice are sparse, and I put that word right up here at the top. In a sparse matrix, not all the cells actually have a value, and so you don't actually need to store them. And so this representation, in terms of, you know, explicitly having a row ID, a column ID and a value, turns out to be pretty efficient. Okay, and in fact, sparse matrix solvers, this is exactly the kind of representation they use internally. All right, so if you have a sparse matrix and if you encode it and if you represent it in a database, then expressing matrix multiply is not too bad. What you wanna do is find all the columns, you know, for each column number in the matrix A, find the corresponding row number in column B and then add up all the contributions to the new value. And I'll show a diagram of this after. In fact, you know what, let me skip going into too much detail about this right now because I'm gonna talk about this in detail in preparation for the homework where you'll do this. So right now, I guess the takeaway, what I want you to take away is that doing, you know, representing matrices inside of a database sounds very unusual. It's actually not the world's worst idea. And in part of the readings from this mad skills paper, you'll try to, you'll see why. So right now I just want you to take away that it can be done and it's not necessarily a terrible idea. Welcome back. So I wanna talk a little bit about how the term data science relates to other fields of science. And in particular, I wanna introduce this term E science which to a first approximation you can think of as equivalent to data science. So while the term E science is associated with astronomy and oceanography and biology, data science has been adopted more in business but they involve a lot of the same concepts. So let me tell you about what's going on in science. So for thousands of years, scientific inquiry has been empirical, right? You observe the natural world or in some cases maybe replicate the natural world in a controlled environment in the laboratory and make observations about that. In the last few hundreds of years, science has accepted theoretical models as a valid method of inquiry, one that is reinforcing empirical methods. So new theories suggest new experiments and the theories help explain the observed data you get from the experiments. In the last 50 years or so, high speed computation has emitted an entirely new method of scientific inquiry. You can simulate in the computer phenomenon that otherwise couldn't be, you couldn't study, you can't observe directly and you can't reproduce in the lab and even the theoretical models become too complex to solve analytically using essentially paper and pencil, right? But you can actually start from initial conditions and run the simulation to get a result. So this is maybe what goes on in the interior stars or the shift of tectonic plates or the evolution of the universe or the effects on the ecology from some species dying out and so on. So that's fine, so that's three methods of inquiry but in the last 10 years or so, there's been arguably a fourth method of scientific inquiry which is to acquire massive data sets from instruments or from simulations and then explore these data sets using new algorithms and new infrastructure. And so eScience is really about massive and complex data, data large enough to require automated or semi-automated analysis. You can't look at it, you can't inspect it directly. And so the relevant tools here are the same as those for data science. Databases, visualization, scale out computing, maybe the NoSQL systems, machine learning techniques, web services and so on, okay? So the way, this idea of the fourth paradigm, there's a book that's in the reading list that you can refer to here and there's a lot of, there's some other articles in the reading list that you can also refer to. The story's been told lots of ways. The way I like to talk about this story is that science has always been about asking questions but conventionally it was really about querying the world. You would sort of have data acquisition activities, experiments or field studies that were coupled to very specific hypotheses, right? You had the question in mind first and you went out and collected data. But eScience has really sort of shifted a bit where now you're kind of downloading data en masse, you're downloading the world first, putting into some sort of a representation in the computer and then querying that database to test your hypotheses. And so the data can be acquired independent of any specific hypotheses in some cases, okay? And this is due in part to the cost of data acquisition dropping precipitously thanks to advanced technology, right? So the telescopes you can build now that we'll talk about in the next couple of slides can acquire enormous amounts of data at very high resolution. And in the life sciences, you have sort of laboratory automation and you have high throughput sequencing. In oceanography, the sensors are getting cheaper. The models, thanks to Moore's law, and advances in computing, the simulations you can run are getting bigger and higher resolution and therefore producing larger and larger amounts of data and so on. And so the rate at which data can be produced has far outpaced the rate that we can analyze it or come up with the questions we need to ask about it, okay? And this suggested a new approach to science. So let me give you some examples. So we've said that eScience is driven by data more than by the computation, right? So some examples of the size of the data that's coming out. The Apache Point Telescope, that was the primary instrument for the Sloan Digital Sky Survey that we might refer to multiple times in this course, produced 80 terabytes of raw image data over a seven-year period. At the time, this is a pretty significant data size and even by many standards is still today. The next generation of this, the next generation project that's in the same sort of spirit as the Sloan Digital Sky Survey is the Large Synoptic Survey Telescope. So this guy can produce 40 terabytes per day and will do so for over a 10-year period. So in total, 100 plus petabytes and producing the same amount of data that Sloan Digital Sky Survey produced in over a total entire lifetime, it can produce that over every two days, okay? And so this is a pretty staggering amount of data and requires a pretty different approach. One thing I wanted to mention maybe about Sloan Digital Sky Survey, what they actually did here was to take the images, cook them, right? Extract the relevant objects from it, put all those objects into an off-the-shelf relational database. In fact, it was Microsoft SQL Server and critically host this database online and serve it out over the web. And this required a pretty significant investment in infrastructure, but as a result of doing this, of making all the data public and queryable, it became the most productive astronomy facility in history, right? So the number of papers that have been produced on this data is on the order of thousands and the original PIs of the project, the principal investigators of the project had sort of maybe on the order of 100 papers in mind for the data and the other 4,900 papers that have been written all came from external partners writing queries against this database. So just a wild, wild success. Now the problem is that the same technology stack and to some extent even the same approach is difficult to apply in this case of the large enough to survey telescope. The reason why this guy's producing so much more data is not just that it's much higher resolution and can perceive a much deeper field in the sky, but also because it's returning to the same point in the sky frequently every three days. And so this allows you to look at things that change over time. So asteroids, comets, you might get a supernova and so forth. Okay, and by comparing these images in the time series, there's all sorts of new questions you can ask, okay? So both because of the science that they're gonna do and because of the sheer scale and because some of the complexity of the details of how the data is acquired, the existing, the previous solution won't work. And so this is motivating a whole new area of research to study data management techniques and data analysis techniques to support this project, all right? So in life sciences, these high throughput sequencers are capable of producing terabytes per day when run continuously. And major labs that do this work, such as the Joint Genome Institute, have 25 to 100 of these machines running all the time, right? So this is spitting out an enormous amount of data for, I was gonna say, a variety of samples. So it's maybe individual organisms or it could even be samples from the environment where there's no one particular organism in there, but there's an entire population, all right? So for a variety of uses, these guys are able to spit out the data. In oceanography, the regional scale nodes of the NSF Ocean Observatories Initiative is a project led here at UW. Ocean Observatories Initiative is a multi-institutional partnership. The regional scale nodes part is run at the University of Washington. So this project is evolving 1,000 kilometers of fiber optic cable on the seafloor connecting thousands of instruments in chemical, physical, and biological, thousands of chemical, biological, and physical sensors, including live video from the seafloor to measure, to monitor volcanic activity. Okay, so again, the database, if not a relational database, the data sets and data infrastructure required to support this effort is significant, it's motivated a lot of new research. In the information space, there's a lot of science to be done directly on the web itself. And so just the web, you know, a single computer can read 30 to 35 megabytes per second from one disk. And so it would take about four months just to read the entire web. So new clusters of machines. So summing up a little bit, eScience about the analysis of data. So the automated or semi-automated extraction of knowledge from massive volumes of data. And so your main instrument for looking for answers are the algorithms and the technology as opposed to direct inspection. There's just too much of it to look at yourself. But it's not just a matter of volume. And as we'll talk about in the next segment, this is another link back to what's going on in business, right? There's this concept of big data and there's the three V's of big data that we'll talk about a little bit more next time. But let me just mention them here. Where volume, so the three V's are volume, variety, and velocity that you'll read about. And I'll give you where these terms came from in the next segment. The volume refers to just simply the number of rows or the number of bytes, right, the sheer scale. Variety is perhaps the number of columns or dimensions, but in science, for example, a lot of, in the life sciences in particular, you'll have experiments that involve accessing multiple public databases, as well as multiple sensors, your own data that you've collected and that of your colleagues. And the integration task of putting all this data together is a significant bottleneck, even when the actual scale of the data is not necessarily all that bad. Okay, so this is the complexity of the data. And then the velocity, you know, we saw with the large optic survey telescope that, although the scale itself was enormous, the fact that 40 terabytes are being collected every two days means that the infrastructure needs to keep up with that pace. And just transferring that data from the telescope facility to the data analysis facility is an engineering challenge. Okay, and you'll also see other V's here where things like veracity, you know, can we actually trust this data? Okay, so a bit more of that next time. So to summarize here, science is in the midst of a generational shift from a data poor enterprise where you can never, you know, there's never enough data to a data rich enterprise where there's so much of it you don't know what to do with it. And as a result, you know, data analysis has replaced data acquisition as the new bottleneck to discovery, right? So it's not that the cost of going out and getting the data, it's the cost of actually analyzing the data you might already have. So this is fine, but what does this have to do with business, which is probably where a lot of you are coming from and where your interests lie? Well, what we see is that business is beginning to look a lot like what's always been happening in science. So though, you know, businesses are requiring data aggressively and keeping it around indefinitely in case it becomes useful. They're beginning to hire people that have training and skill sets that look a lot like what's been important in science for a long time, especially in mathematical depth. And they're beginning to make decisions with this data that are very empirical. So we're always wanting to sort of back up every decision with a clear case based on data. And so for these reasons, I think that you can take the lessons learned in science and apply them in business and actually vice versa as well, because one thing where science is lagging behind businesses is in the adoption of technology. There's been proportionally a lot less spent on IT infrastructure in science than there has in business. And so there's this, it's a great time for this. There's this cross-pollination of ideas between both fields. Okay, and so going back to this first slide that I gave, eScience and data science have essentially everything in common. So we might use examples interchangeably between the two. Okay, so I wanna spend a little time on the term big data. And I'm not too concerned with any sort of technical definition of the term because it probably doesn't exist. But I wanna arm you with some of the language that people use when they describe big data so that you can speak intelligently about it when asked. Okay, so probably the main thing to recognize is this notion of the three Vs of big data, which are volume, velocity, and variety. And we talked a little bit about this in a previous segment. So just to repeat, volume is the size of the data and measured in bytes or number of rows or number of objects or what have you, sort of the vertical dimension of the data. The velocity is, what I'll say here is the latency of the data processing relative to the demand for inactivity. And that's maybe a mouthful. But what I mean by that is, how fast is it coming based on how fast it needs to be consumed? And so there's a lot of applications for which interactive response times is increasingly important if not directly important, okay? And so when this becomes the bottleneck, when this becomes the challenge, then this velocity term starts to become pretty relevant. And then the one that I think is really pretty interesting to me and is near and dear to my heart with my research is this notion of variety. And so here the problem is, an increasing number of different data sources are being applied for any particular task. And so you need to pull out ASCII files as well as download data from the web, as well as pull data out of some database, as well as use some NoSQL system. And so on. And the integration of all these different data sources is a pretty significant problem and can end up occupying a lot of your time. And so I made this point a couple of segments ago about researchers who spend 90% of their time quote handling data. This is where a lot of that time is going. This notion of variety, okay? So all three of these are relevant in performing sort of data science tasks. All right, let me give you another notion and I'm gonna go back to use science examples and you've seen some of these before. But if you sort of make a plot of number of bytes on the y-axis versus number of data sources, maybe columns of data in a single table or columns data across multiple tables or numbers distinct data sources on the x-axis, you can sort of map out different fields of study or different problems and sort of see where they lie. And so typically astronomy has been challenged by the sheer volume of data. And so they're sort of up here. We're high on the y-axis, but you know, but the number of actual sources in astronomy is not too high. There's telescopes, there's these spectral images and then there's simulations of the galaxy. And so that's relatively few in say the ocean sciences and certainly in the life sciences, although I only show one example here, the variety is really more of a challenge. The actual sheer scale is not as high as the hundreds of petabytes that can be generated by these telescope infrastructures like the large synoptic survey telescope. But the number of different types of instruments you can use to acquire data is large and ever-growing, right? So you have these glider systems that will go out for months at a time and kind of porpoise through the water. You have autonomous underwater vehicles that are more for short-term missions. You know, there's oceanographic cruises where they deploy these conductivity, temperature and depth instruments that can take profiles of the water, right? So this is, you know, at a fixed x, y and a varying z and a varying t, a varying depth and a varying time while the gliders are sort of varying in all four dimensions. You have these simulations that are probably one of the largest sources of information. Right, so these can be mesoscales sort of at the order of the entire Northern Hemisphere or the old Eastern Pacific or they could be models of a particular bay or inlet or estuary connected to a river and connected to the open ocean, so a much smaller scale thing. So there's a lot of diversity there. And I say stations to mean these sort of fixed stations where there's a particular sensor deployed at one location and just measuring across time. ADCP is a acoustic Doppler or something profiler where they're using sound waves to record the time that the sound waves take to bounce off particulate matter in the ocean and they can therefore measure velocity. And so this gives you an entire profile of the velocities in the ocean. You can mount these on the seafloor pointing upwards or you can mount them on the bottom of boat pointing downwards and so on. And then there's satellite images that measure sort of sea color and wave breaking as well. Okay, so fine, so just a little more on the term big data. The notion that Mike Franklin at the University of Berkeley uses, which I like is that big data is really relative, right? It's any data that is expensive to manage and hard to extract value from. So it's not so much about a particular cutoff. What makes it big? Is it a petabyte scale is big versus a terabyte scale is small or a gigabyte scale is very small since it fits in memory on your machine? You know, not necessarily. It depends on what you're trying to do with it and it depends on what sort of resources and infrastructure you have to bring to bear on the problem. And so in some sense, difficult data is perhaps of what big data really means, right? It's not so much about big. It's about being challenging. Okay, and so this is really important to remember that big is relative. So let me give you a little bit of history of the term big data. There's the earliest notion I could find was from Eric Larson in 1989 where he says the keepers from Harper's Magazine eventually went into a book, the keepers of big data say they do it for consumers benefit, but data have a way of being used for purposes other than originally intended. So his point was not really about technology at all. It was just a notion that data is being collected for one purpose and being reused for another, which is a theme that I mentioned in the very first segment in this course and we'll come back to over and over again. And so I think he had it right in that sense. So his real point was that about consumer private data is starting to be commoditized, which was absolutely true and fairly prescient at the time since it's become a big issue now. But, and it's been perhaps especially impressive that given that this predates the rise of the internet and already sort of foreshadows very topical issues in big data such as ethics and privacy and sensitivity and so forth that we'll talk a little bit about. But this isn't quite what we mean by big data nowadays typically because it didn't have that technology aspect to it. It didn't talk about the challenge of actually managing these data sets. All right, so another point of reference is that more recent reports from some of these consulting firms get credit for these notion of three Vs but really the original source of this was a report from Gartner in 2001 written by a guy named Doug Laney. And so we talked about volume velocity variety which we've said but let me just give you a chance to look at these quotes, you know. So in volume it's, he's really talking about sort of business to business. When you think about 2001 this is around the dot com boom and so everyone was trying to figure out what this new era of technology was gonna get what the internet was really gonna give to them beyond just sort of putting up a webpage and serving it out to your customers. What, how are you gonna be able to interact with your supply chain or your vendors and so on? Okay, and so this is what we need by this notion of e-channels. But you know, up to 10x the quantity of data about an individual transaction may be collected. You know, absolutely true that this data exhaust to this point we've made a couple of times is giving rise to a larger scale of data being collected. You know, and on velocity, well it's increased the point of interaction speed, right? So this is that need for interactivity that didn't used to be so required but as the velocity of all business and all transactions sort of increase so do the constraints on the infrastructure used to process it. And then on variety, you know, I like this one a lot. So through 2003, 2004, right? So he was being sort of fairly conservative about how far out he wanted to predict. No greater barrier to effective data management will exist in the variety of incompatible data formats non-aligned data structures and inconsistent data semantics. And so this is great. This is, you know, he could have said this through 2015 and been arguably correct. This problem has not gone away, okay? All right, so another point in the history of this term, big data, there was a series of talks, a lot of work by John Mashe who was formerly the chief scientist at SGI who would talk about big data being the next wave of infra-stress. And so what he meant by infra-stress was what's gonna really drive the technology forward? Where are we gonna feel the pain? And his point was that the IO interfaces was where it was tough. So in particular, disc capacities were growing incredibly fast and still are and that the latencies are not keeping pace, right? So you can go down to a local store and buy a three terabyte drive for probably $200. But the rate at which you can pull data off that has essentially the same as it's been for many, many years. And so now it takes you hours to actually read every byte of data on that disk that you've stored. And so this is a problem because the actual analysis you can do over all those data, we can keep it and that's fairly cheap, but we can't actually do anything with it because the pipe is so small, okay? And so this is one of the arguments, he made several sort of very cogent arguments about where the bottlenecks are in the infrastructure, but they all sort of revolved around this idea that large big data and big data processing were gonna be the stress point. And so this is probably a pretty appropriate use of big data, although John Matthews has said that he's not sure that he deserves any credit for coining this term since it's a fairly generic term. And he was using it in one way and we use it now in a related way, but it's not necessarily that it captures everything we need today. And I probably agree with him. All right, so just another quote about big data sort of today where we're trying to capture exactly how it is being used. The necessity of grappling with big data and the desirability of unlocking information hidden within it is now a key theme in all the sciences. And I like this, arguably the key scientific theme of our times. And so we've talked about science and we talked about the fact that I think what's going on in science is more or less equivalent to what's going on in business. And so I think this is nice, right? It's really maybe the key problem across all fields is unlocking what's going on inside big data, right? Getting some, extracting value out of big data. So fine, the final point I wanna make is, where does all this big data come from? And we said this a little bit before, but one is data exhaust from customers, right? So we're actually tracking a lot more information about interactions with customers than we used to, right? It's not just about taking their order. It's about monitoring their clickstream that they use in order to get to that order. It's about not just about sending advertisements to them, but it's about watching how many clicks are on each advertisement and whether the number of clicks goes up or down to where that ad is placed on the webpage and so on. Another point that, especially true in science, but I think it's also true in business, and I'll give a couple of examples of this, is that the availability of new and pervasive sensors, right? We're actually able to get visibility on data sources that we previously couldn't. And I'll give a couple of examples of that in a second. And then I think, I mentioned this already, with the technology of data storage, right? Just this capacities of disks has gone up. The cost of storing a byte has gone down. And so we sort of have this ability to keep everything whether or not we need it, or at least there's a perceived ability to keep everything whether we're not and we need it. And so people are doing so, right? The other things that they would have otherwise thrown away they're starting to keep. And then scratching their heads thinking, boy, how might I use this to make predictions and make better business decisions? Okay. Okay, so just a couple of examples of sensors that may or may not lead to massive data, but just examples of where we're getting visibility on data sources that we didn't previously have. So one is, the fact that all new cars are gonna be equipped with these black boxes that are a lot like what's going on inside what their airliners have. And the reason is for forensics in the event of a crash, but they also record a lot of other information. And so insurance companies have similar devices that you can opt in, you can voluntarily plug in to reduce your insurance rates that track your speed, track other kinds of aspects of your driving habits. So this technology simply would have been hard to imagine 20, 30 years ago, but now that we have the technology, why not actually start collecting the data from it? Okay. And so while the purpose here is pretty clear, at least for the insurance companies and for these black boxes or crash forensics, you can imagine repurposing that data for other purposes. And in fact, that's what this article is about as well. Boy, is there a privacy risk here given that there's one purpose, they're being deployed for crash forensics but they might be used for other purposes. And for insurance companies, you can imagine how they would collect it with a particular actuarial risk model in mind, but may develop other models in the future given that now they have this data. Okay. And so I think this is really a theme of big data is that we're collecting new source of information independent of whether we know how we're gonna use them or not in the future. All right. So let me give you another couple of examples from research here by Schwedig Patel in the Computer Science and Engineering Department. And there are several related devices and here I just mentioned two of them, HydroSense and ElectroSense. And so both of these devices are intended for consumers to use to monitor their own resource consumption. And so instead of having to kind of rewire your house to monitor the consumption of every device in your house, every faucet and every shower and so on, you can just clamp this device on the main water line coming into your house and it will monitor the pressure changes associated with every individual device. So it can disambiguate that flow and recognize the signature that every device puts on the pressure changes. And so every time you turn on the shower in the upstairs bathroom or flush the toilet in the downstairs bathroom, this thing can tell you, can identify when that happens and give you a log of the events. And by analyzing that log, you can tell which device, where most of your water is going. Is it mostly going for showers? Is it mostly going for a dishwasher and so on? ElectroSense is very similar in that every electrical device in your house puts a recognizable load on the signal, on the power signal that can be read and disambiguate it to tell you where your energy is going. Okay, so these are just examples of new sensors that are coming on the market where data is being derived that otherwise wouldn't be able to be derived at all. Okay, in this segment, I want to talk about logistics of the course. So how we've organized this course is a guided tour of important trends along with a deep dive into specific topics. And then there's a set of hands-on assignments that are intended to deliver specific skills and experiences, and that's perhaps the most important part. Okay, and so overall the course is not, you know, the challenge here was to design a course that would be broad enough to cover the topics that we want and also inclusive enough that we didn't sort of have to dial it in for a very specific cohort. But the challenge then is that it's going to be very difficult for some people and others may find it, some aspects of it certainly routine. I'd be surprised if anybody finds the whole thing routine. If so, then I'd be surprised that you took this course. Okay, so the prerequisites here are pretty light because we are trying to cast such a wide net, but they're really important. So some prior programming experience in some language is going to be really critical. Then, you know, we're going to use terminology from basic college statistics or advanced high school statistics. So when I talk about linear regression, you should know what that means. You should also be able to sort of look at some visualization of data and be able to understand what it's telling you. Okay, and then perhaps the toughest one is that statistics. Perhaps the tough one is to have some exposure to databases and databases concepts. And, you know, if you're just starting out in college, that's not always an easy proficiency to have gained or experience to have gained. But, you know, it's not, we're going to couch a lot of the discussion in terms of databases and in the relationships to databases. And so some idea of what that means, what they are is going to be helpful. Okay, so to that end, one assignment will involve writing SQL. And if you've never written SQL before, but you understand database a little bit, you will probably be able to power through the assignment. If you're an expert in SQL, there's some parts of it that might still be interesting to you. Then two assignments will be required. We're involved writing Python. One optional assignment will involve sort of processing big data using Amazon web services. And here, you know, one of the reasons it's optional is that because of the varying skill sets, but another reason is that you'll have to pay out of pocket for the cloud resources. And the reason for that is there's, you know, 60,000 students who signed up for the course and we can't sort of pay for all of them. The good news is it will cost sort of less than $10 or so. Okay, and it's optional. So if you don't feel comfortable with that, you don't have to do it and you'll still get full credit for the course. All right. Then another assignment will involve competing in a Kaggle.com project, participating in a Kaggle.com competition using whatever you want. And so this may or may not involve any programming. A lot of valid assignment, you know, you can certainly compete by using Excel and other kinds of GUI tools. Okay. This last bullet probably isn't true, so let's just ignore that actually. Okay, so the learning objectives here is I really want people to come out of this course, being able to talk intelligently about the landscape of data science, concepts, tools, algorithms, and technologies. And this will be sort of a springboard for you to dive deeper into particular areas. So for example, machine learning. This is not a machine learning course, but you can dive deeper into machine learning by taking this course. This is not a database course, but you can dive deeper by taking a database course and so on. Okay. Then I also want to deliver some hands-on experience, manipulating data in order to level set people that don't have any programming experience and provide some specific experiences for those of you who do have programming experiences. For example, the first Python assignment will involve competing in some sentiment analysis using Twitter data. And so if you already know Python, the learning Python won't be much of a contribution to that assignment, but perhaps it's the first time that you've been able to work with the live Twitter stream. Okay. So the end result of this is we hope that you'll be sort of an advanced beginner in a variety of data science topics. And as I said, the tough part here is sort of how to do something more than just superficial access given that data science encompasses such a broad area as we've discussed. And so we think we put together a pretty good program, but you'll have to be the judge of that. Okay. All right, so the risk of belaboring this, the course of the philosophy here has been that the skills needed by the data scientists span a variety of different areas, statistics, programming databases, distributed systems, visualization, but the traditional organization of these topics is sort of vertical and is not ideal for becoming sort of introductory in data science, right? So in order to get introductory level knowledge in all these areas, what you end up having to do is take an introductory course in seven different areas or something. So a lot of different courses. Okay. And so our goal is to try to expose and simplify the links between these different areas, okay, as opposed to sort of narrowing our attention on what makes them unique, okay? All right, so, you know, after taking this course, you will not be an expert in statistics. You will not be an expert in machine learning, certainly. You will not emerge an expert in databases or even no SQL, nor will you sort of have programming versatility in all of these language. However, you will use all of these tools. You will understand the basic concepts of all these tools and you will have applied many of these tools, okay? The assignments will have online short quizzes during the lectures, of which you've already seen some, these finger exercise quizzes. There will be a set of the full length offline assignments, as I mentioned, and some of the assignments will be graded some of the programming assignments will be graded automatically. Some of the assignments that don't lend themselves to auto grading will be assessed using the peer assessment tools. So an example of that is you're gonna write up a description of your Kaggle solution, in addition to submitting your score for the Kaggle competition and other students are gonna sort of grade whether it's comprehensible or not, okay? So here's my background in one slide. So I have a bachelor's degree in industrial and systems engineering from Georgia Tech, but all the problems in industrial engineering tended to be about optimization and automation, which seemed to require software. So I sort of got more interested in computer science. So I spent a couple of years consulting with some big firms. Solmberg does oil field services and Siebel does customer relationship management software and you're probably familiar with Microsoft and Verizon and Deloitte is a managing consulting firm. And then I went back to grad school and got a PhD in computer science from working with oceanographers on query systems for large scale oceanographic models. And then I spent a couple of years working directly with oceanographers as kind of a data architect. And before coming to University of Washington where now I lead a group in scalable data analytics for the University of Washington E Science Institute and also an affiliate assistant professor of position in computer science and engineering. And so there's a bit of a mix of very practical kind of applied work as well as my research agenda. And so I think that this data science trend that's occurring sort of strikes close to home with me. I think it's a great time for it. Okay, this is a walkthrough of the first assignment for introductions of data science and here we'll be using Python to process some Twitter data. So the first thing to do is to go down to the link where you can get the class virtual machine and make sure you have that if you're gonna use it or to make sure you've installed Python if you are not using the virtual machine. Okay, and I encourage you to read this material about the assignment and about Python, especially if you're new to Python, but I'm gonna scroll right down to problem zero. And so some of these steps, problem zero included should be pretty straightforward for you if you've used Python, especially if you've used it extensively, but there'll be some steps in this assignment that should still give you a challenge. And if you're new to Python, this is a great way to get warmed up. So I've opened the virtual machine here and there were a couple of things I had to confirm to allow VMware to convert it to VMware format, but it wasn't too hard. You can also use virtual box, which some folks have had, have done. Okay, so here I am in the home directory and there's some things that have been automatically exported by VMware from the host machine. You can ignore those, but we have this data side course materials directory and I'm gonna go in to that. And the first thing to do every time you start an assignment is to do a get pull. Make sure you have all the most reasons, changes to get assignment. This allows us to fix bugs in the last minute and not have to worry about how to get you those changes. Okay, so I'm already up to date, so I can just get started. We can look at what's in. I'm gonna go to assignment one, which is what this assignment is. And I'm gonna scroll down to problem zero and see what it asks me to do. Okay, so the first thing here is to, we're gonna access some tweets through the public API that doesn't require any authentication. So this is pretty much identical to just going to the Twitter website and typing in a search term, in this case, Microsoft. And I'm gonna do that programmatically by copying this little snippet in. So we're gonna need a text editor for this and I typically wouldn't use a GUI text editor, but I'm going to, in this case, I think it might be the most democratic approach and this text editor called g-edit is already installed, so that's what I'm using here. So I've just pasted this in. You may or may not be able to paste into a virtual machine without some configuration. I've done that configuration, which amounts to, for VMware, amounts to installing VMware tools. You're not gonna go into that here, but hopefully you can do that. If not, then you can type it in and it shouldn't do too hard. That's really the only thing you're gonna have to be able to copy and paste. The rest you'll be typing anyway. Okay, so I'm gonna save that. I'm going to assignment one and name it print.py. And now I'm gonna go back over to the terminal window and I'm gonna run it. Now I'm choosing to not use a Python ID development environment. You certainly can if you want. I typically use this kind of a mechanism for a variety of reasons. One is just because it's a little bit more portable when I find myself in an environment that I don't have a lot of control over or that I can configure. Once I get sort of hooked on a development environment, it becomes awkward to not use it. So I sort of prefer to just not be hooked on any of them. But if you have one you like, by all means use it. Okay, so there I just ran print.py and I see a bunch of junk on the screen. So let's go back to the code and see if we can understand what this junk is. So what this line is doing, this is a response from the website, which is very simple, just a one-liner to open the URL and get whatever the website sends back to us as a string. And then this library parses that string because we're asserting that this is in a format called JSON and this JSON library knows how to understand that. And so what we get back is a Python object that has all this complicated structure in it. So all this complicated structure was there in a string and now we parse it into a Python complicated object. So what kind of complicated object? So let's say pi response. And we can check the type of pi response to understand what we're looking at and print that out. And it says it's a dict. Okay, so a dict maps keys to values and to access the keys in the dict we can use a method called keys. And I'll do that now. And so this object appears to represent a page of results. There's sort of a page thing that might be the page number. There's an indicator of the next page and so on, but what, it could be the original query you sent. And in fact, I'm actually somewhat guessing on some of the meaning of some of these. You can read the Twitter API documentation to try to sort this out or you can just manually inspect it. But this results key sounds intriguing. So let's take a look at that. So how do we actually access the value for a particular key? We use the square bracket, oops, excuse me. We use the square bracket notation and pass the key name as a string. And in Python, you can use double quotes or single quotes interchangeably. And so I've used double quotes here. So let's print that guy out. All right, so there's more junk. So the results is some complicated data structures. Let's do our standard trick here and see if we can understand what that is. So I'm gonna check the type of this thing and I'm gonna comment this out. And so comments, you can use a little hash character. And so that's a list. So what can we do with a list? Well, first thing I'm gonna do is put into another variable just to kind of keep myself sane. So these are now the results. And I'm going to print out the first element of the list. And so here we're using the same square bracket notation, but instead of passing a key name, which is what dictionaries support, I'm passing an index number, which is what lists support. Okay. And you can look at the Python documentation. There's some links in the assignment to learn about the methods supported by both the dictionary and the list type. Okay, so now we printed this guy and we get one element of the list, which itself looks like yet again, another complicated, that's gonna get annoying, another complicated object. And I happen to be able to get a feel for this. You can recognize that this is a dictionary because of these curly braces, but we don't necessarily need to guess. We can just check it out once again. So what is the type of this list element to dit? All right, same thing. Let's look at the keys of it. And notice I'm not caring too much about a lot of software engineering here. We're just trying to inspect some data that we pulled from the web. And this is something that comes up quite a bit, I think, in these data science tasks is you're looking at some data for the first time. And so it's not necessarily the time to get too obsessive about software engineering because there may be no one that's gonna look at this script besides you. You're just trying to get a feel for things. Okay, so there I've printed out the keys for this dictionary and I see some things I don't understand, but I would argue that perhaps text looks interesting. So let's take a look at the text of the tweet. We're presuming that this is indeed a tweet. And so now here, this is a little bit ugly. You know, I'm not saving the result as a variable. I'm just gonna print out directly, but this is okay. This expression, so what's the type of this expression? Well, that's a list and we know with a list we can access it with this square brackets and a number. So what's the type of this expression? Well, that's a dict and we know that we can access those with these square brackets and a key name. And so the type of this expression is whatever's in text and we haven't actually looked at that yet. So let's do that now. So that looks like a tweet. Okay, and we can try out another one. And that's a different tweet. And if we wanna loop over all tweets, we can do that too. We can say for I in, use this built in range function and now we can print out all the tweets from the first page, one per line, which is actually what the assignment asks for. Now for this one, I don't bother to turn anything in. That was just sort of a warm up, especially for folks who haven't really looked at Python before, but that's how you can kind of inspect a complicated data structure and try to make sense of it without doing too much work. So now we're gonna get access to the actual 1% live stream. So this data was again, just like going to the website and typing in a search term and looking at the results. But that's not a, we can't get a actual sample of all the tweets that are coming in. And so for that, you need to actually log in to your account and do a little bit of configuration. And the steps to do that configuration are here in the assignment. And we'll walk through that real quick right now. Okay, so we go to, by the way here, if you're not using the virtual machine, you will need to install the OAuth2 libraries and make sure you notice that. Okay, so if you first need to create a Twitter account, you don't already have one. And then once you do, you can navigate to this URL and you'll create a new application. I've got a couple here. You won't have those, but you can create one with that button. Okay, so I'm gonna name this, how about assignment one, just gonna name it walkthrough, why don't we do something more like what you might do. And this is a data science assignment Twitter application. And so what you're doing here is registering an application that's going to consume this Twitter feed. Now we're just gonna write a couple of scripts with it, but imagine you're building some sort of a website that was gonna do some analysis. They wanna know about it and have a little bit of metadata about who you are and who you're authenticating it as. So in case you do something strange, they can be aware, have a little more information about what you're doing with it. Okay, so here you can put in basically any URL and this is again just a way to help provide some problems in case you do something untoward. I'm just gonna put in my homepage and you can essentially put in anything. Actually, yeah, and just to be clear, it actually literally can't be anything. It's not even gonna be sort of validated. Okay, so fine. So here's an agreement here and the capture and I'm gonna agree to that. Type in this guy and then create the output. Whoops, that's already been taken, which makes sense. So why don't we say data science could have been a nice complicated name and I don't care too much about it since we're not actually building an application with this. It's actually interesting. I didn't realize it had to be globally unique unless I've already done one. Okay, so now we have some information and some credentials here, but there's one more step. We actually have to authorize this application to be able to use the Twitter data and I'm not gonna go into the details of OAuth here. It's somewhat complicated but necessarily so to sort of support all the requirements we're trying to do to do web-based authentication. So you can read about it if you're curious but I'm gonna just sort of blindly follow the instructions in this walkthrough. Okay, so it's been created. It gives me this little green message but it takes a second to come back. So you're gonna sort of stare at this and nothing's gonna happen and I'm gonna refresh and there it is, okay. So don't be alarmed if it doesn't immediately come back. It takes a second. Okay, so now we have two sets of credentials here. The same called a consumer key and a consumer secret and access token and access token secret and we're just gonna plug these in blindly to this file, twitterstream.py. And right at the top of this, you've got placeholders here where you can put these. So let me put those in. So there's those and now the rest of this file is not something you need to edit. This is just manipulating the OAuth protocol to get the results. So I'm just gonna save that guy and then run it. Python, twitterstream and sure enough, I'm getting tweets from the website and this will just run forever constantly streaming down tweets but we don't wanna send it to the screen. We can't do anything with it. We would rather send to a file. So I'm gonna hit control C to cancel the stream and then I'm going to redirect this to an output file and I happen to know that the format is again this JSON format which stands for JavaScript object notation and so I'm gonna redirect it to that. I'm gonna hit return on that and it will sit there and silently blink at me forever. So I'm only gonna leave it running for maybe a little while. What I've asked for in the assignment here is to let it run for 10 minutes or so so you can get a fairly big data set. There's a step further down in the assignment that require a reasonable sized data set to really be meaningful and 10 minutes should be enough and if you wanna run it longer, you can. In an early version of the assignment, I actually tried to ask people to run it for hours and hours and hours but that's turned out to be a little bit unwieldy. So I pulled back from that. Okay, so that should be long enough. I'm gonna hit control C and I'm gonna look at what's going on here. So now I have a four megabyte file called output.json and I can look at that file, the first 20 lines of that file with this command and so that's what you're gonna turn in for this assignment is the output of head dot the first 20 lines of the file. So I'm gonna pick the output of this command and put it into turning.json and you can turn that in on the Coursera website and I'm not gonna walk through that step in the interest of time. Welcome back. I wanna talk a little bit about data models as a run up to talking about databases over the next couple of segments. So this is a data science course so we know we have data. So the first question to ask is just where is this stored? How do we store data? And so one way of interpreting this question is just to talk about technology, right? To what technology do we use? Well, we use magnetic media and we might use solid state drives more recently, right? And so both of these have the property that they persist even when the power goes up. The data is safe even when the power is not on. Okay, so non-volatile storage, right? And so a lot of the work in databases is all about how to work with data that's stored on these non-volatile storage media. But another way of interpreting the question is a little different, a little more of the logical way we store it, the organization of the data. And so we might ask, one way of interpreting this question is what is the data model we're using, okay? So it's not just bits on a disk or bits in a file, what do we do? Well, you know, in your personal computer or maybe in your computer you might store data sort of hierarchically arranged in these kind of nested folders, right? That's one organization of data. And so the data model here is kind of tree-like. Another way is rows and columns and this is what we'll talk a lot about in this course, right? So in this case it's an ASCII file and these are hits from a biological database, matches in a biological database for a particular sequence. And of course you might have spreadsheets that are a little funny, right? Maybe they look a little bit like rows and columns but maybe they don't. So you have sort of an embedded table within a spreadsheet and so on. So you need to sort of, the idea is to think about what data model is being applied whenever you think about data, right? So it could be a tree, it could be a table, it could be something, a grid like this unstructured, or it could be a graph. And we'll talk a little bit about that in the future. All right, so in general, what is a data model? Well, there's gonna be three components that you should remember, right? One is that there's gonna be some notion of a structure, right? So in the case of tables, it's rows and columns. There's gonna be some notion of constraints. What are the legal structures you're allowed to create? So for example, typically if you think about a tabular data model, you won't have all the rows will have the exact same number of columns, right? This is sort of a constraint. You might also have more constraints on the values themselves, such as this field must be this column, every value in this column is an integer. And you can even have other kinds of more semantic constraints such as every column in this, or every value in this column must be within a certain range of numbers because it represents, say, days of the year, okay? And the third one is the operations. And so this is sometimes thought of as independent of these three, but I really like to call the data model all three of these things, the structures, the constraints to define valid structures, valid instantiations of these structures, and then the operations that these structures support, okay? So let's see some examples. So your structures might be, as we mentioned, your rows and columns, nodes and edges if it's a graphic graph model. Key value pairs has been popular with the NoSQL movement. Just a sequence of bytes, that might be the structures you have if you're just working with bare files. And then the constraints you might imagine are all rows of the same number of columns as I mentioned, all values in one column are the same type. For a hierarchical view, you might have a child cannot have two parents, right? So this would define, you know, for example, one file cannot be in two folders at the same time in your data model of your file system, okay? And then the operations that are supported. Well, maybe you can look up the value given a key X in these key value pair data models. That's one of the primary operations is if I give you the key, you give me back the value. For a tabular data model, you might say, well, find me all the rows where a particular column has a particular value. In this case, you know, a column last name is equal to the value Jordan, okay? And with a file, there's not too many operations you can think about. It's to essentially get the next invites moved to another position within the file and then you can open and close the file. And that's about all the operations that are supported. Okay, so I think in any case you see data, especially on non-volatile storage, you can think about what operations are supported, what constraints there are over the structure and so on. And this gives you an idea of the data model. All right, so what is a database? Well, one definition that I think is pretty adequate is this one, which is a collection of information organized to afford efficient retrieval. All right, so this is a very pretty general definition. It doesn't say anything about tables, it doesn't say anything about relations. So when you think about database, don't necessarily assume relational database. It's perfectly adequate to talk about a database that has nothing to do with relations. All right, but it is not just a pile of data either. It's organized to afford efficient retrieval. So another view of a database is this idea of a schema. And so Jim Gray in this fourth paradigm book that we talked about in the e-science segment a little bit ago has this quote. When people use the word database, fundamentally what they are saying is the data should be self-describing and it should have a schema. That's really all the word database means. And so this goes back to this notion of a data model, right, there's a structure there and in fact, and some constraints and some operations. And all three of these are things you can intuit by looking at the data itself. It needs to be self-describing, right? So if I have a file of data that's organized into rows and columns, somewhere I'm able to inspect which columns it has and how many rows there are and so on, right? I need to be able to understand how to read this data just by looking at the data itself. And so in a database, for example, there'll be a catalog. There will be an explicit schema. So having this idea that the word database is synonymous with self-describing data, data equipped with a schema, that's a pretty common view and it's an important one to keep in mind as we go through this course. So let me give you another view of a database. Motivated by this question, why would I want one in the first place? Why would I want a database? What problem do they solve? Well, there's maybe four issues you might run into that putting your data into some kind of a database broadly defined can help you solve and these are the ones that I like to talk about. So one is sharing data, right? So, you know, once you have multiple users trying to access the same pile of data, some sort of infrastructure or interface to manage concurrent access starts to become required and this is something that all databases, all, I would argue, all real databases afford, okay? Another is, you know, the enforcement of a data model. You might say, well, look, I use a rows and columns data model or I use a hierarchical data model, but there's some sort of software that needs to enforce that. And so this is something that databases can do. Now, remember, the data model is not just the raw structure of parents and children, say, with trees or rows and columns. It's also higher level constraints such as, this must be a number from one to five or this must be a day of the week or it must be one of the customers that already exists in another table. So these kind of constraints are difficult to enforce in the application layer and I'll argue this more in a little bit. Okay, so the third reason might be scale, right? So we know we have a pile of data and we know we have sort of a data model floating around here, but once it gets over a certain size or we get a certain number of instances of this data model, we're gonna wanna use specialized algorithms to be able to work with it and writing all these specialized algorithms ourselves to traverse large and larger datasets becomes the bottleneck. And so a database can organize these algorithms and expose them through convenient mechanisms. We talk a lot about complexity hiding interfaces, right? Databases provide a complexity hiding interface for large scale data. And I think the fourth one is flexibility, which is, you might have written some software to access your set of files in a particular way, but as soon as you have to access it in some way you didn't anticipate, right? You have to rewrite a bunch of code. And so what databases try to do is anticipate a broad set of different ways of accessing the data and working with the data and support all of them. Okay, and so I'm talking pretty abstractly here fairly deliberately because all of these things are certainly true of relational databases, which we'll talk about, but I think they ought to be true of anything that sort of earns the term database and they're certainly true of other systems as well. And so we wanna try to keep these things in mind when we think about no SQL systems and key value stores and graph databases and so on. Okay, so it's not just relational databases. This is broader than just relational databases. All right, and so I think in general, when you're looking at these different systems and thinking about the data storage layer, which is sort of where we're starting here in this conversation about data science and this course about data science, is how is the data physically organized on disk? You know, what kind of queries, what kinds of operations are gonna be efficiently supported by some particular organization and what kinds are not, right? So one immediate question, well, and this third bullet at least, or immediate question, is it hard to update the data or add new data? That's one quick way to split all the different operations you can do on the data. One quick one is the reads versus the writes. And in many cases, you'll find that what's convenient, you know, the organization that makes sense to read the data efficiently is not the same one that makes it efficient to write the data. And that trade-off is at the heart of a lot of the system design challenges in databases and other large-scale systems. Okay, but there's other kinds of operations as well as we discussed. You know, do you wanna look up by key? Do you wanna look up by some other field and so on? And we'll talk about this in future segments. Okay, and then what happens if I encounter new queries that I didn't anticipate? What do I need to completely reorganize the data? Do I need to write a bunch of new code? You know, how hard is this? And so this, these are the evaluation criteria you're gonna use when you're thinking about choosing a platform for your data science task. Okay, or these are some of the questions and they're perhaps not all of them, right? So, if broadly your choices are a pile of files, an off-the-shelf database, a NoSQL system, or something else, having these questions in mind as you're evaluating the pros and cons is gonna be really important, okay? All right. So last time we talked about data models and used them to motivate databases and define the term database in a broad sense. And now I wanna build on that to talk specifically about relational databases, okay? So last time we talked about these questions you could use to reason about different ways of organizing data and evaluate them with respect to your requirements. And I wanna talk about these questions and apply them to examples of different kinds of databases you saw in the past that end up motivating the relational model. And in particular, the reason I wanna go through this sort of historical view of things is that you see some of these same designs being proposed in terms of NoSQL systems and some of the same issues come up, both benefits and pros and cons are still there. So it's good to have a historical perspective on them when you're evaluating these modern systems that are becoming popular, okay? So the questions we talked about were, how is data physically organized on disk? You can ask this about a system. What kind of queries are efficiently supported? How do you update things? And so on. All right, so one example is what I'll call a network database, although arguably this is just sort of pre-databases where you just had files. And if you go back to our questions, you might ask, well, how is it physically organized on disk? Well, if you're using sort of parts and orders model here, you would have a order record and it would have an address associated with this order record that would physically point to the first part associated with that order. And that part would point to the next one and so on. Another field in the record would point to the customer that made that order, okay? So going back, what kind of queries are efficiently supported? Well, if I wanna find all the parts associated with an order, I can do that pretty efficiently. I have an access to, I have a given order and I just walk down this chain to gather up all the parts, okay? What kind of queries are not efficient, supported efficiently, is I wanna find all the orders that involve a particular part, although all the orders that involve this washer. Well, now I have to scan every order to look for them. Okay, there are some ways around that by putting back pointers and so on. Another problem with this file oriented, you know, sort of proto database model is that whenever I wanna make a change to the data, right? If I wanna have an extra field added to support the billing customer as opposed to the shipping customer. Well, I've just added a new field. I've extended the length of this record. That means that everything else below that record needs to be moved. More importantly, all the programs that navigate this structure now need to be aware of this other field. They all need to be rewritten to accommodate this extra piece of data, okay? Moreover, if you wanna support different access methods as we talked about, if I wanna look, say, by part and find all the orders, I end up having to make a complete second copy of the database. And now when I update, when I make a change to one copy, I need to make a change to all copies. And you can imagine how the space of possible copies might grow, might get pretty big. Okay, so a partial solution to this problem was this notion of hierarchical databases characterized by perhaps IBM's IMS system, which actually still exists and still has customers. And so here, the idea was to order, organize data in terms of segments. But still the logical model had this hierarchical flavor that we saw in the network model as well. So here I've switched, I've made the top level access the customer instead of order. And so logically what you have is that an order is only located underneath the customer and a part is only located underneath an order. However, given that they're in separate segments, I can make a change to one segment without having to break all the code that accesses other segments. The downside that still exists here though is that the programmer, the application developer still needs to understand this hierarchy in order to find anything. They have to actually know exactly how things are organized. For example, that orders appear under customers. So you still have to anticipate what kind of access methods your customers are gonna want and design for those. Updates here are a little bit easier given that I can add an order to one segment stored elsewhere without affecting all the other structures. And I can even add a field and I can only make changes to the orders as opposed to changing everything. Moreover, the software layers on top of this were able to sort of insulate from those kind of changes with some reliability. So this new field would only be passed back to the client when they actually needed that new field. So there's some measure of what I'll call data independence and we'll talk about that a little bit more in a few minutes. Okay, so moving towards relational databases, though one view of what a relational database really is is here I'm quoting Kurt Monash who's an analyst for the database industry. And he says, you know, relational database management systems were invented to let you use one set of data in multiple ways, including ways that were unforeseen at the time the database was built and at the time the first applications were written. And so I wanna emphasize here is that this is the key idea of relational databases, not SQL and not some of the other things you may associate with particular implementations. It's really just about organizing the data in such a way to support unforeseen access methods, querying in ways that you didn't anticipate when you organized it, insulating applications from changes. So what is a relational database? Well, at the simplest level, everything is a relation which is anonymous with a table. Everything's rows and columns and this probably doesn't need to be made explicitly but let me do so. Every row in the table has exactly the same columns, has the same number of columns but they also have the same types. So if a column is an integer in one row then it needs to be an integer in all the rows. And then a consequence of this model of everything being a table means that you don't have pointers anymore. You don't have physical addresses. All you have is tables. And so relationships between different data items are implicit. So instead of having the, so here we switched the domain to one of courses and students. So this table is a student takes, we'll say takes, a student takes course and this is a student record. Well, instead of having a physical pointer from the course record back to the student, we just have a shared ID. The only relationship between these two data items is the fact that they both have the same value in a particular column. And so this is, intuitively, this sounds really bad for performance right off the bat. If I wanna go look up all the students' names associated with a particular course, once I have my course, I need to go look up in this table all the values that match as opposed to just navigating directly to them which you could do with the hierarchical method. But if I wanna go the other direction, it's the exact same process. I look up the name, if I wanna find all the courses that a student has taken, right, I can do so the same way. I do have to do the lookups which maybe is a cost and performance, but the mechanism by which I look things up is the same in both cases. Okay. Moreover, everything is stored only once which is a feature that the hierarchical databases were able to achieve as well in most cases. Okay. But the network databases were not. We don't have multiple copies of things lying around. All right. So the philosophy here is being cute about this. The quote from the 19th century is that God made the integers and all else is the work as man. Well, you know, Cod made the relations which is a reference to Edgar Cod who wrote the first relational database paper and went on to win the turning award for his work which is sort of the Nobel Prize in computer science. Cod made relations and all else's work as man. So everything is a table is the number one thing to remember about the relational data model. Everything is a relation. All right. So let's actually break here and I'll pick up with this slide next time. Okay. So let's talk about relational databases. So the history here is that, which I motivated last time I hope is that pre-relational, if your data changed in some significant way, if you need to reorganize things in some way, your application broke. Okay. So if you changed the parent child relationships in the hierarchical model or if you pretty much did anything with the network or file oriented model, your applications had to be rewritten to support that. Okay. And so early relational databases addressed this issue and even though they were buggy and sort of slow, they required only about 5% of the code you had to write previously. And so this was an enormous win. Okay. And so this quote is sort of motivating, sort of following on what the quote I used from Kurt Monash in the previous segment is from the original paper on databases from Ted Codd. So activities of users at terminals and most application programs should remain unaffected when the internal representation of data has changed and even when some aspects of the external representation are changed. And so the reason I want to emphasize this is that, again, this is the key idea of relational databases, not SQL and not some of the other features that you associate with particular implementations. It's really this notion of data independence. Okay. And this was the, right there in the abstract in the original paper. This is the key idea. And then, and the reason I'm hitting this so hard is that this idea is still just as important now as it was then. All right, so I want to go through some of the other key ideas that were, that are associated with relational databases, whether or not they were in the original paper. So one key idea is that programs that manipulate a tabular, the manipulate tabular data exhibit this algebraic structure that we can use to reason about them and manipulate the logical model independent of any physical data representation. So what I mean here is that if you think in terms of tables and you think about the operations that tables support, you can think about what your program means and even how to optimize it, which we'll show, regardless of how the bits are actually organized on disk. And this is incredibly powerful. Okay. So the key idea here, again, is physical data independence. And we'll talk about what logical data independence means too in the next segment. So the programs that you write to manipulate things no longer have to sort of manipulate files and sort of chase pointers around, you can in this case access it through a high level language SQL. Although again, it doesn't have to be SQL. The point is that you're manipulating logical structures called tables. So just know the term physical data independence and know that it means that your programs you write to manipulate data are more robust than they would be without this relational model. So another key idea is that there's this algebra of tables that I mentioned. And we'll talk more about these are specific operators in a bit, but at a high level, one operation on the table is to select out rows that satisfy some condition. Another is to ignore columns that you're not interested in. Another one is to, for two tables, for every record in the first table, find corresponding records in another table, right? Select, project and join. And there's other operations you can define as to aggregation, all sorts of set up operations derived from set theory, union and difference in cross product and so on. And so these operations, if you write your expression out in terms of these operations, it's very clear what it means. And it's for software engineering purposes, it allows the database designers to focus on just implementing these operations efficiently. Okay. Now, if I could, if I had you on the classroom, what I'd ask is how many people have heard of the relational algebra and also asked how many people have worked with databases? And typically the number of people who've worked with databases is very high and the number of people who have heard of the relational algebra is somewhat lower. And that's one of the things I hope to fix in this course is to equate the two, right? If you understand databases, I want you to understand relational algebra and vice versa, I guess, comes for free. Okay. So why do we care about this algebra? Why am I saying algebra? Well, you know, when I'm giving a talk and describing using the slide, well, I'll also ask is how many people have heard of algebraic optimization? And typically very few have, even if they're computer scientists, unless it's a room full of database people. But the thing is that you already understand what this is, right? You don't have to know databases to know what this is. This is just something you learned in high school in algebra class. Okay. So forget tables for a second. Just think about integers. Well, I've got this expression here and I want you to evaluate this expression when I tell you z is equal to four. Okay. So one thing you might do is just well, say, well, you know, four times two is eight and four times three is 12. And so that's 20. And I add zero and that doesn't change anything. And then I divide by one. Fine. But if you're clever, you might notice that, well, adding zero to any number doesn't change it at all. So I'll just ignore that altogether. Similarly, dividing any number by one, or any integer by one is the same number. So I'll ignore that as well. And then if you're really clever, you might notice that there's a distributivity law here that says when I see this pattern, I can pull out the multiplication. And by applying these rules in turn, including commutivity law that allows things to be reordered, I can simplify this expression down to this. And this just says, well, now two plus three is five, just multiply five times four and I get 20. And I've done few operations. I've only done two operations instead of five and I didn't have to do division, which is potentially an expensive operator if you're thinking about a computer evaluating this. Now, do computers use this kind of symbolic reasoning when they evaluate expressions over integers? No, the answer is no. And the reason is that this kind of symbolic reasoning is much, much more expensive than just evaluating the damn thing, right? So fine. But if the objects that you're manipulating are not small integers, but rather terabyte size tables, then this kind of symbolic reasoning is not only valuable, but it's absolutely critical. If you do things in the wrong order, if you do wasted work, or you do more operations than you need to over massive tables, you're dead in the water and you'll get nothing done. And so all databases, all relational databases rather, do this kind of algebraic optimization when you write a query, right? And so if you think in terms of SQL, if you're familiar with SQL, your query gets translated into a relational algebra expression in terms of selects and projects and joins, and then is manipulated according to algebraic rewrite rules just like you learned in algebra class. And that's why the term algebra is there. And they attempt to simplify the expression. Simplify, the reason I pause is it simplifies perhaps not the right word because it's not always true that the shorter the expression, the faster it is. We actually use this notion of cost-based optimization, which means we'll try lots of different kinds of equivalent expressions, assign each one of them an estimated cost and choose the one with the lowest cost. And this is something that all relational databases are doing in one form or another. Okay, so fine. So this is the magic trick of query processing in relational databases, and this is a really, really great idea. And the reason why this works is because we understand very formally what these operations are and what they mean. Okay, and so when you relax this formal model and start allowing anybody to write any kind of code they want over the data, you lose the ability to do this kind of algebraic optimization, and you leave it up to the programmer to write the best possible algorithm. And what I'm hinting at here is, we'll talk about it more later, but when you think about writing large-scale data processing pipelines in something like MapReduce, and if you haven't heard of MapReduce, don't worry, we'll talk about it, you're leaving all the work up to the programmer to not only write the logic, but also to do the optimization. And this can impose a penalty. Okay, one final comment about this is, the term algebra is not just kind of trying to connote algebra from high school, it literally is the same thing. And so when you hear the word algebra, what you should be thinking of is this notion of algebraic closure. And what I mean by that is, every operation that applies to a table also returns a table. And so I can chain these operations together to always get tables, right? Now, that's the exact same idea that's going on when you talk about operations over integers or real numbers. And you might sort of quibble and say, well, if I divide an integer by some other integer, I may get a real number, and that's true, but there's notions of multi-sorted algebras with different types involved. But the point is that this notion of closure, you can't escape the system by applying operations, is always true when you hear the term algebra, okay? So we're not making things up, fine. So here's some relational algebra expressions that if you squint hard enough, you can see kind of look like similar expressions over integers, except instead of addition and multiplication, we have things like joins and selects. And so what this says, I haven't shown you the query, I don't expect you to initially see this, but what this says is, select certain values from a relation R, and here I'm gonna select other values from the same relation R, that's okay. I can have two different, you know, the relation R can appear in two different places in the same expression, no problem. And then join them together, then select still other values from the relation R and join this one. And one way of evaluating this plan is to perform this join first and then this join second, and indicated by these parentheses, right? Another way to evaluate this expression is to perform this join first, and then form this join second, indicated by the parentheses. Still another expression is to take the full cross product of all three relations, which I haven't told you what cross product is. This is actually a pretty bad one to do. If you do know what a cross product is, it generates an enormous amount of data and you would never actually wanna evaluate this plan, but you could, and it's provably equivalent to these other plans. So you know that it returns the same answer. And now all we have to do is figure out which one of these is likely to be the cheapest one, and then we'll choose that one to run. And this is the kind of reasoning that all databases do internally whenever you write a query. Okay, so where are we now? So we've given an overview of data science itself, and one of the things we talked about was that there's this important aspect of, you know, data munging or manipulation, cleaning, restructuring, and so on that is perhaps, you know, ill-defined, but is kind of what keeps people up at night when they're working on data science problems. Okay, then we also gave an overview of relational databases, kind of a history of relational databases and why they came into being in the first place. And, you know, we found that the original problem being addressed was this one of physical data independence that, you know, when aspects of the data changed, all the applications broke. And so you wanted to insulate applications from certain kinds of changes. Okay, and one of the tricks here, the secret sauce of relational databases is this algebra of tables that allows you to reason about manipulation tasks, reason about data manipulation tasks, independently of the grubby details of the physical representation. Okay, so this idea will come up over and over and over again, even outside of the context of, say, you know, Oracle and Microsoft SQL Server and IBM DB2 and so on. You know, you don't have to be talking about a commercial flagship relational database system to make use of this relational algebra. Okay, and we'll see that. And so I wanna spend some time in this segment and probably the next few segments on understanding the relational algebra. And so, you know, at times this may look like more of a theoretical exercise, but I promise you it's not, right? There's an entire database course offered, say, here at the University of Washington and everywhere else that I think is a great idea to take. We're taking out segments of that that are demonstrably practical in a data science setting. Okay, so I'll also mention that many of the slides in the next segment or two came from the Introduction to Data Management course developed by Dan Ciccio and Magda Velzinska and is taught here at the University of Washington. Okay, so the relational algebra operators that we hinted at but maybe not listed out explicitly are these, include the set operations that are lifted to support relations and we'll see examples of that. And then the big three are selection, projection, and join. Right, and we'll talk about the meaning of those. And then there's these extended relational algebra operators that have to do with manipulating duplicates of tuples. And on the next slide, I'm gonna explain where duplicates come up and why it's important to make a different distinction between working when they're in the presence of duplicates and when there aren't duplicates. Okay, and these include just an operator to eliminate duplicates altogether. There's a group by operation that you may be familiar with if you work with SQL and this operator appears here and then you can sort and so forth. These are extended in some, these are also extended in the sense that I shouldn't have said just duplicates. It's also sorting, for example, doesn't deal with duplicates. It's extensions of relational algebra off of, you know, away from the pure set-based, set theory-based model. So for example, a set of objects doesn't have any kind of order applied to it. And yet we were allowed to sort things in SQL. Okay, and it's a practical, you know, it's something that's practical for applications to be able to define what kind of order the tuples come back in. But it's not part of the formalism. It was added in afterward. And so that's this extension between the pure relational algebra and the extended relational algebra. This is probably as close to the theoretical underpinnings that I'm going to care to get. The difference between these two comes up a lot when you're trying to prove properties about the formalism, right? Because the extended relational algebra is much more difficult to prove things about if you can at all. But as a practical matter, the difference between these two classes of operators is not particularly important. Okay. So the takeaway here is that there's a big set of, a meet, you know, a rich set of operators. But if someone says the relational algebra to you, the first thing you should think of is set operations plus selection projection and join. Okay. All right. So this notion of sets versus bags, the duplicate question. Well, so first of all, what is a set? A set is a collection of objects where there are no duplicates and a bag is a collection of objects where there can be duplicates. And so right up here, you know, A is repeated, is not repeated at all in a set, but it may be repeated in a bag. And whether that's legal or illegal is what gives you the semantics of a set versus a bag. So you can define the relational algebra in terms of these two different semantics. You can define it in terms of set or you can define it in terms of bag. And this notion of an extended relational algebra comes from the need to sort of work with bags as well as other things like sorting, as I mentioned. Okay. And so the rule of thumb here, I'm really gonna mention this. The rule of thumb here is that every paper you read, if you end up reading some of the papers if you talk about this course or beyond, will, you know, unless it's set explicitly, we'll assume set semantics. Okay. So be prepared for that. While every implementation, you know, every commercial database will assume bag semantics and we'll sort of see where that comes up in the language. Okay. So I just wanted to put that out there up front that, you know, I may play fast and loose with the difference between sets versus bags, but it can be important in practice. Okay. So one lifted set operation, you can define the union of two sets in the standard way. The union of two relations is natural given that a relation is a set of tuples. And in relational algebra notation, I would write it like this and I can also write it in SQL with the union keyword. And here's where set and bag will come up. If I want to, by unqualified, union does indeed remove duplicates, in which case the answer is of the, the union of this relation with A1B1 as a tuple and A2B1 and A1B1 and A3B4 is these three tuples. The duplicate of A1B1 didn't get passed through. To express this in bag semantics, to make sure we do include duplicates, you can say union all and that would include all four tuples. Okay. You can define the difference operation the same way, or in the same way in the sense that you're lifting it from the set, from the natural definition of over sets that find every set, every tuple in this set and remove any tuples that also appear in this set. And we see A1B1 as we saw before also appears in R1 and so you take, you get rid of it and all you're left with is this tuple. So why isn't this one in there? Well, if it doesn't, if A3B4 doesn't appear in R1, we know it's not in the set. All we want is everything that's in R1, removing things that also appear in R2, okay. All right, so what about intersection? That's another set operation that we could lift up. You can indeed define intersection, but you don't necessarily need to have it as a fundamental operator because you can re-express it in terms of difference. So if I want the intersection of R1 and R2, I can take everything in R1 that is not in R2 and then I can take everything in R1 that is not in that result. So if you think about this for a second, this expression returns everything that is only in R1. And then this expression overall removes everything that is only in R1, leaving things that are both in R1 and R2. And so that's what the intersection is, okay. And we'll touch on this later, but you can also express intersection in terms of join, which that operator haven't defined yet. Okay, the selection operator is how we take tuples that satisfy a certain condition. And so we write it with sigma and we put C to express the condition. This notation, honestly, we won't necessarily use too much throughout this course, but I think it's good to be familiar with it when it does come up. I'm more interested in recognizing the sort of English translation of these, select, union, join, and so on. The Greek notation is perhaps less important. Okay, so if we want to find where the salary is greater than 4,000 over an employee or where the name is equal to Smith, that's an instance of a selection operator. And the, let's see, it says the condition C can involve equals, you know, less than greater than equal to and so on, but it can be more than this, right? It can be any sort of Boolean expression. And in fact, as we'll see in maybe a segment or two, it can be sort of any arbitrary function that returns a Boolean value. So it doesn't have to initially just be a less than b or a equals b. It could be some complicated function. Okay, but let's say in between some complicated function that's user defined and a simple condition like this where you just say salary equals 4,000, or sorry, salary greater than 4,000, you can have arbitrary Boolean expressions. So you could have conjunctions. You could say where salary is greater than 4,000 and S name equals Smith. And of course you can say or and you can say not. All these are legal. Okay, so as an example, if we want to have a selection where salary is greater than 4,000 of an employee, which, excuse me, which one passed this test? Well, I've been saying 4,000 this whole time and that's 40,000, excuse me. These numbers look for a moment. So John has a salary less than 40,000 and so he can be removed from the set. And so the result of this expression is this table. We have tables in and tables out. The result of this expression is this table saying three columns and only two tuples in it. Okay, I guess sometimes I gesture here. I'm not sure you can see it when I light up. Maybe you can. Okay, a project operator eliminates columns and this is another one where you have to be sort of careful about set versus bag. So when you see a projection in the set semantics, you're going to remove all columns that aren't explicitly listed, but you're also gonna remove all duplicates that might remain. So if I project away all columns except for the last name, then you might think I get Bob Smith and John Smith and I get rid of first name. I'm left with two tuples or one tuple, right? Well, in set semantics, everything must be a set and so duplicates get removed automatically. In bag semantics, they'll both be there. And if you write this query in SQL and the commercial database, what you'll get back is two tuples, both Smiths. So you'll get duplicates. And then you can explicitly ask them to be removed with a different keyword called distinct. And we'll see an example of this. All right, so fine. So as an example here is project social security number and names. We only want two columns out of all the original columns that might appear in the employee table. All right, and so here's an example of the set semantics versus bag semantics. Here's the original table with three columns. And we project onto name and salary. And so that means we get rid of SSN. Well, now we have three instances of a guy named John and with three salaries. And the bag semantics would be okay without the set semantics would remove one instance of this duplicate. And so the question here is which one's more efficient? Well, removing duplicates is expensive. So just leaving them in place is more efficient. In fact, that's really the motivation as to why, one of the key motivations, why commercial databases assume bag semantics is that on every single query where you want to choose which columns you have, having an extra step to remove duplicates is sort of wasteful. In typical applications, you may not care. Or you may sort of know by domain semantics that the remaining columns are unique anyway. So you can tolerate duplicates. And so forcibly removing them just to be pure with respect to the underlying semantics of the formalism was not particularly necessary. So an operator that you won't hear much about but I do want to mention for a couple of reasons. One is that it's useful as a reasoning tool when you're thinking about manipulating tables. And then the other one is it actually is coming up more and more often in sort of analytics applications and data science applications. It didn't come up very often in traditional relational database applications. And that's this notion of cross product. And so a cross product here is for every tuple in R1, for every combination of tuples in R1 and R2 produce a tuple in the output. And so the size of this thing is the size of R1 multiplied by the size of R2 equals, here I'll just do it this way. Oops, do you think? All right, okay. So fine. So an example of where you might see this and sort of give you a flavor of why this is coming up more and more often is that you often want to find all pairs of objects that have some similarity condition. And while in any particular instance of this problem there might be tricks to do this more efficiently, the brute force method of just generate all possible pairs and then apply some function to determine their similarity is often used in practice. So for example, if you want to compare two images for similarity, you have two faces and you're trying to see if they're the same face and you've got a piece of code that can do this. You now need to run, and you have two big tables full of images or just collections of images. Well, generating all possible pairs and applying the function is always a reasonable way to do this, okay? So these kinds of operations are coming up more and more often. And I'm pausing because I was wondering whether to sort of mention briefly some of the techniques you can do to get around this, but I don't think it's gonna come through very well without a little bit more visual aid. So let me skip that for now. Okay. All right, so what does the cross product look like? Well, imagine you had a table employee with these two columns and a table dependent with these two columns. And we do a, whoops, there we do a cross product of employee with dependent. That's all possible combinations of employees with all possible combinations of dependent. So we know there's going to be four tuples in the output. And you can check to see that John is here twice, once for every instance in dependent and so on. So now let's talk about join and I probably maybe should have put a slide in here just about join in general first. So really we're talking about join. The most common instance of join that you're gonna run into. And in fact, if we don't qualify it, this is what we mean is what I'll call equa-join. And so equa-join is a join with an equality condition right here. Okay, so what does join do? Join says, for every record in R1, find me a corresponding record in R2 that satisfies some condition. And in general, especially for those of you familiar with databases, you're going to be thinking about in terms of equa-join. This would be for every course in student, if find me student IDs that appear in particular courses. Okay. So joins on primary key and foreign key, if you don't mind the jargon, are instances of equa-joins. They need not be, is one point I wanna make. So we're not talking too much about schemas in this course. I'm more interested in teaching about the relational algebra and showing how it comes up rather than teaching about how to design a database in the first place. And the reason for that is that, you often don't have the luxury, I made this point before, we don't have the luxury of a engineered schema. You don't have time to build one, you don't have, there isn't one handed to you in the first place. There perhaps isn't even much need for one in that you're going to get your answer to a couple of questions and then you might sort of move on to something else. So there's not really a way to amortize the cost of developing the schema in the first place and so on. So, if you've worked with databases, most of the joins you're doing are going to be along these predefined relationships that are called foreign keys, but those need not exist in order to apply a join. And then I'll point out just a sort of syntactic note, you can write this two different ways in SQL. You'll sometimes see select star from R1 join R2 on some join condition. And other times you'll just say, select star from R1 comma R2 where this condition is met. Now, if I literally translated this into a brain dead relational algebra plan, and we'll talk about sort of how to do this a little bit more mechanically in a bit, this is actually sort of saying, well, hey, look, first build the cross product of R1 and R2, right? And then filter that cross product such that this condition is met, okay? And this one is saying, no, no, no, don't do that. Actually, use the join operator rather than generating the cross product. But the optimizer, the databases are not that stupid, right? They're smart enough to figure out that even in this case, the right way to do this is to express a join. And so in practice, there's no difference between these two different ways of spelling the same query, okay? And in fact, to a first approximation, two equivalent queries, different syntax, but same semantics in SQL, there's not gonna be any difference between them. The optimizer doesn't care how you write your SQL. It's going to optimize the thing anyway. It's gonna turn it into a relational algebra plan and manipulate that plan to guess the best way to evaluate that query, okay? And so I say to a first approximation because there are such things as query hints and other ways you can sort of tell the optimizer how you'd like the query to be evaluated. We're not gonna talk about that because they're rarely important. And the other thing, it's not impossible to have two different queries that do the same thing that the optimizer can't figure out are actually equivalent. And so it's not impossible to get two different plans, but typically that won't be the case. And in this example, that doesn't matter at all, okay? Join versus in the where clause. I'll tend to write queries this way even when they come up, okay? Fine, so that's the most common and the simplest instance of this join operator. In fact, I think I did as a disservice here by doing this on left and right. This is actually kind of a nice example. This is the SQL equivalent of writing join and this is the relational, sort of the SQL equivalent of writing cross product followed by a selection, right? They're the same. In fact, the only reason why in the algebra, you know, remember the algebra has a formalism, you know, maybe you don't care so much about deriving new operators as long as you can express the thing, you know, it doesn't matter. There's so much work and there's so many good algorithms for implementing join that it sort of deserves its place as a specific operation. We don't wanna have to write cross product followed by a select when we're actually sort of talking about the efficient join algorithms, okay? So more generally, you can have what we'll call a theta join. And this is essentially just a join, but the condition here can be anything you want, okay? Rather than just an equality condition. This could be greater than or less than or arbitrary functions and so on, okay? And so this all pair similarity test that I talked about before is an example of a theta join. And we'll see a more detailed example in a second. And so just to point out that equa join itself is a special case of theta join where theta is just the equality condition. All right, so there's some examples of theta joins just to sort of demonstrate that these come up pretty often in practice more than you might be familiar with. And again, especially speaking to the people who are familiar, who have experience with databases, these are not going to be a long, foreign key relationships quite as often, right? Okay, so if you wanna say find all hospitals within five miles of a school, well, this doesn't immediately seem like a relational algebra query or a SQL query, but it kind of is, right? It's just a join where the join condition is this distance function over the location of the hospital and the location of the school. Okay, and then I did a projection here to sort of project out the name of the hospital because the English version of this seemed to suggest that we just want the name of the hospital and that's it. Okay, and so in SQL, this might look like this where you say, look, give me all combinations of hospitals and schools and then filter on the ones where the location of the hospital is less than five miles away from the location of the school. And here I'm kind of assuming that there exists some distance function that knows how to compute this. And we'll talk at the end about how new functions that are not part of the language or not part of relational algebra can be registered in the system and that's this notion of user-defined functions. But trust me for right now that these things can exist. Okay, and in fact, they don't even have to be user-defined. There's many functions that are already available in databases for manipulating, say, for example, strings. And in fact, even for geographic information, there actually are distance functions available in those commercial databases. Okay, so you'll see this structure. The takeaway here is that, I want you to still think join, right? Just because you don't see a quality condition doesn't mean there's not a join going on. It's just the same kind of join as everything else. And then the other thing, the other takeaway is just to know the term theta join in case that comes up, okay? Usually when anybody's talking about a theta join, what they mean is, you know, difficult joins, right? Arbitrary joins, the general case of joins. All right. So another example that's maybe you might just sort of be able to think about coming up in practice in your own work is, well, you know, find all the user clicks made within five seconds of some page load. And this is sort of much like the distance argument before, but now we can think about it in terms of time, which is just a one-dimensional, the one-dimensional district metric is easier to define. So we say, find the click time minus the load time of the page, right? So C dot click and P dot load and take that absolute value and see where that's less than five. And so this might be when you're trying to find people who find what they're looking for quickly, right? This is a metric that web analytics people might use frequently, right? If people sort of stare at a page for a long time, maybe if it's an article that might be good, it means they're reading the article. If it's a navigation page, it may be bad, it means they don't find what they're looking for quickly. You might also hear about band joins or range joins. And this is things like find me, there might be an interval of time, start time and end time in one table. And you're trying to find tuples from another table that fall within that interval. And we'll actually see an example of that. There's another join to recognize that it exists is this notion of an outer join. And here, what you're saying is you want all the tuples from the left, R1, R2, we'll write it like this with this sort of missing leg here. You'll want all the tuples from the left side if you've written it this way. And if the tuple on the right-hand side matches, great. You put it out and it's just like a regular join. But if there is no match, you still include the R1 tuple and you pad out the other columns with null as needed. Okay. So any value you don't have, make it a null. All right. And so the variants here that aren't particularly important is left outer join, right outer join. So, oops, geez. Left outer join, right outer join and sort of full outer join, which is a little bit hard to write because it looks like a cross product. The right outer join you really sort of never need because you can always just reorder the operations. Full outer join means that you want everything from both tuples padded out with null. And these are sort of ugly to reason about formally, but they come up pretty often in practice because especially for users, when they're, well, I'd say especially for users, as opposed to applications. You know, if you're writing a query by hand, basically, many times people find it surprising that records in their table disappear because they joined it with another table, right? But that can happen because you've said you only want pairs of tuples where some condition matches and so you might have no matches and things disappear. And so outer join comes up as a useful way to match more with the SQL programmers expecting, especially with novices, okay? And so an example of this, we have two tables here, anonymous patient and an anonymous job, we could do an outer join. Now, what columns did we join on here? Well, it doesn't specify, we sort of omitted it here, although technically we should write that, you know, right there in the subscript of the join operator, but we didn't, but of course you can probably figure it out. Well, so you look at the columns that they have in common, right? This has an age column and this has a zip column and this has an age column and this has a zip column. And so it's actually on both of those columns, right? For every tuple in P, find me a corresponding tuple in anonymous job J. Oops, there's an extra in there. Where age equals 54 and zip equals 98125, okay? So if we had just done a join, then this tuple would be removed from the output because it has no corresponding tuple on this side. There is no 3398120. But because we did an outer join, we do include the tuple and we've had it over the null here, okay? And we write this in SQL, I wish I had included this, we write this in SQL, let me back up a couple of steps here and I'll show you. So just like we have join here, you can actually write outer join explicitly, outer if you wanted to. I'm not gonna leave that in the slide because it'll be confusing since it's out of context, but up, and in fact you can say left outer join I guess to match our example. Okay, put in the slides. The database that you'll be using in the assignment is SQLite, which has some nice properties for a single user case. The entire database is stored as a single file and you can pass it around and so on. So it's a good tool to sort of having your toolbox. There's one of the reasons why I selected it for the assignments, but it actually has some limitations and one of which is it can't express certain kinds of outer joins. Okay, in particular full outer join. Okay, so now I wanna talk about how to interpret or get some examples of how to interpret SQL statements sort of in terms of relational algebra. We're not gonna actually write out the plans, but I wanna give you some experience staring at what may seem sort of complicated and kind of teasing out what's actually going on here. And so for people that have spent a lot of time around databases and SQL, these may or may not seem particularly complicated, but if you're just starting out, they probably do. So in this first example, what do we see here? Well, what you wanna look for when you're sort of staring at something that may seem sort of hairy is, look for the from clause here. And so in this case, it's a little funny, right? Because we see, oops, we see that the from clause does not have a table mentioned, it has a nested query within it. And we remember that that's perfectly fine because of this closure property of the underlying algebra. We know that any relational algebra expression and therefore any SQL statement you can return a table and it operates on tables. So if you can operate on table, if you're operating on tables and you know you return a table, then you can sort of chain these operations together and you have this nice closure property. So we know that we're allowed to query derived results just like we're allowed to query base tables. Okay, and so in this case, we're doing a derived result. Now, well, so let's go down to the other from clause. Well, here we see another nested query, another layer of nesting, again, perfectly fine. And one more layer down, we see this table here. And so where this data came from was a sensor that was mounted underneath a oceanographic research vessel that was collecting measurements of a variety of different variables. Here are a few of them are mentioned, fluorescence, oxygen, nitrates, and there's several more. And then this is latitude and longitude where the ship actually was located at that point in time. And they're also tagged with this timestamp. So they're tagged with latitude, longitude, timestamp, and a bunch of measured variables. And so what this operation is actually doing is aggregating, bending these measurements onto five minute windows. Okay, and so we can see how that's done. So in this inner query, there's some work to, or you call this function to cast the timestamp to a float that's not particularly important. And here we see this trick where we just use a constant value right there in the select clause. And so what that does is sort of a pin to new column with a value five for every record, okay? And I'll say maybe why this was done in this particular case in a moment. And then in the next layer up, we see this kind of hairy expression involving bin size twice. And what it's doing is rounding down the timestamp to the nearest five minute window, okay? So, you know, six minutes in 32 seconds becomes five minutes and 11 minutes in 29 seconds becomes 10 minutes and so on. Okay. And then in both cases, by the way, we see this star here, meaning that all other columns are gonna be passed through. And then finally the outer query we have bin ID, which we computed here. Notice that we've got this renaming operator where we can have this complicated expression and just give it a nice convenient name. So we got that passed through and then we compute the average of these other values, the average latitude and the average longitude within that five minute window as well, okay? So given that we're doing an average, we should expect to see a group by and in fact we do, we're grouping on the bin ID, which would make sense. And then we happen to be sorting by bin ID just to make sure that the record come out in timestamp order because perhaps some application requires it that way. Okay, so why is this sort of overly complicated? Why not just collapse all this into one expression? Well, you could, but it's for the same reason that you might sort of abstract things or refactor things in an imperative language. There's a little bit of software engineering being applied here so that this complicated expression can be reused in multiple places. In this case, it's only being used once, so you could perhaps move it there. But it sort of separates two different blocks of logic. Okay, so in this slide I've color coded the three different blocks of logic, red, blue, and green, so you can see the layers of nesting. But the main thing I want you to take away is that, one is that nesting is perfectly fine and you may see it and not to worry when you see it being used in practice. And second is to kind of convey that as you start to do more and more analysis in SQL, more complicated analysis, there are ways to kind of refactor the complicated queries so they don't necessarily look so complicated. Another thing you could do here that we will talk about in perhaps the next segment is save this result as a view, right? Give it a name and then you could refer to it in the outer query just as a table. Okay, and I'll talk more about that next time. But these are some of the tricks you can play when you're sort of working with in SQL. And in fact, you know, some of the things you may see people do even if you're not playing on doing as much SQL authoring yourself. Okay, fine. So here's another example. Same thing, the first step is to look at the from clause and see what you see. And here we see two tables and there's this keyword inner joins. Now the join is explicit and the join condition here is this where we have some sort of ID and some sort of other ID. By the way, one of the other things I wanna kind of do here is to show that you can kind of analyze the structure of a SQL statement to understand what's going on even if you really have no idea what the data's all about. And in fact, it's kind of helpful to do so. This is something you'll be presented with in a data science context as someone will say, look, we need to predict what the average sales for next month is gonna be. And you say, okay, great, give me the data. And they'll say, well, I don't know, it's in some database over there. And you'll go over there and talk to the DBA or maybe there won't even be a DBA. And you'll have to sort of analyze what's going on inside that database on your own. And so that means staring at the schema which we haven't done, but it also means staring at certain people at queries, which you may not have written, okay. So having a little bit of a skill of analyzing these complicated queries can be important. Okay, so fine. So this looks ostensibly like the join condition, but if you look at the where clause, I wanna make sort of a point here that the table referred to as X, this hotspot deserts, and the table referred to as W are involved in additional conditions down here. Okay. And so these are actually join conditions as well. Right, even though this one was explicitly listed as an inner join on this particular condition, anything that involves, anything that applies a condition to attributes that are in both tables, right. Any sort of condition of all the attributes from both tables is a join condition. And so this whole thing is actually one sort of complicated join condition. All right. So I went a little bit out of order into a little bit different than the order I wanted to go in, but let me backtrack and come back to that. So hold that thought. Popping back up to the top, this other piece of complicated logic here. Well, this is a particular syntax that's available in SQL called a case statement, and it acts about the same way as a case statement in other languages. So that's not too bad. But in particular, even if you just ignore all this logic, you can just collapse all this down and say, well, look, there's some function that's computing Lynn overlap. And where did I get that name? Well, that's what they name the result of this complicated expression, the length of the overlap. And in fact, I happen to know a little bit about where this query came from when they're working on his genetic sequences. And you may be able to deduce that if you could stare at this. There's SNP region, which stands for single nucleotide polymorphism, and there's strain that may give you a hint, and BP is base pair, and non-coding regions, non-coding positions gives you a bit of a hint if you've done any work in bioinformatics. But if you haven't, that's okay. Point is, length of overlap appears to be the name of this thing. So it's almost like there exists a function, linked overlap that involves these attributes, and we don't even care what's inside it. It's just a function. So that helps us sort of see the underlines implicitly of this query in this case. Okay. Then, back to this other complicated expression. Let me show you a little bit about what's going on here just because I think it's kind of a fun example. If you break these out into these three conditions, and you happen to know something about where this data is coming from, you can see that what this is saying is, well, look, we want the start base pair from the X table to be greater than the start base pair from the W table, and we want the end base pair from the X table to be less than the end base pair from the W. So it's this picture, right? What the blue X interval, right? It's a sequence. The X table is filled with ranges, right? Start and end ranges, intervals, and the blue interval needs to be completely contained within the red interval, you know? Or the red interval needs to be completely contained within the blue interval, or the red interval needs to straddle the X start base pair, which you can do if you stare at this condition. Okay, so they're doing kind of an interval logic right there in SQL. And so the point of maybe looking at this in enough detail to understand what's going on is a couple of things. One is this point about the joint condition is that it's not, you know, even if you don't understand what's going on, you can sort of see the structural details to understand that it's just a joint condition. But also that you can actually do certain kinds of analytics directly in SQL. This is a fairly non-trivial operation to do that many people, especially among people who have either not had good experience with databases or have heard from their friends who have not had good experience with databases, that this would be something that's considered sort of impossible. And it's not impossible, and nor is it really even a bad idea. It's actually kind of a natural thing to do. And so analytics in the database should be a part of your bag of tricks. You know, the first step should not be let's pull everything out of the database and then start using imperative code. Okay, this example alone, I would hope doesn't convince you of that claim I just made of, you know, getting things out of the database is a good idea, but there's gonna be a sequence of arguments that I make probably throughout the quarter here and there. Okay, meanwhile I should put the caveat that this is not, you know, I'm not gonna be pushing databases as the ultimate solution to data science but by any stretch of the imagination. But there is a role for it to play. Okay, all right. And so now with these, you know, we collapsed this into one function. We can also collapse the joint condition into these two things. It must match on this CHR field and then it must have this kind of overlaps condition be true that we saw on the last slide. And so this is just an example of a theta joint where there's some non-trivial function being applied on each pair of tuples. Okay, let me stop there and pick up on user-defined functions next time. Okay, let me add a short addendum about user-defined functions. So we've seen a couple of examples of them and in fact I'll click back here briefly. We pretended that there existed a user-defined function called overlaps and a user-defined function called the length of the overlap. And I wanted to point out at least in this context or especially in this context that you can indeed define these kinds of application-specific operations. Okay, so they're called UDFs and you can, as a user, you can write one of these things. You can register it in the database and then you can call it from within your SQL statements and you can assign it, you can grant permissions for others people to use it as well. Okay, so the database becomes kind of a repository of user-defined functions that can be called from anywhere. And so there's three types to be aware of. One is scalar functions, the other is aggregate and the third is table functions and you can tell what they are by how they're used within the SQL statement. So a scalar function will appear pretty much anywhere any expression can appear. So if you can add two attributes together or you can subtract them or do any kind of arithmetic, you can apply a function. So this can appear in the select clause or can appear in the where clause or can appear in a join condition. It can appear sort of anywhere an attribute can appear. Okay, an aggregate function appears in the select clause only and is always associated with a group by. And so here I've kind of indicated perhaps a common aggregate function that users define when it's not available built in and often this one isn't is concatenate. So if you have a table of identifiers and words and you want to concatenate all the words together to make one long string, that's not always something available built in but it's a very natural thing to wanna do, right? So I've got a set of small strings and I wanna create one long string. Well, concatenation is something that's very easy to do in most programming languages and that's how you would express it in SQL. You would do a group by, right? You say, this example's a bit abstract to maybe go through live by waving my hands a lot but if you've got some grouping attribute which defines a set of related strings, maybe I wanna concatenate them all together. And so you might define a user defined function to do this concatenation built from a string primitive that's just a pin to pin to pin. You know, if you have a function that concatenate two strings you can build a user defined aggregate that concatenate many strings, okay. Fine, and then table functions appear in the from clause and they're arguably the most complicated. So the most common example you'll see here is some kind of a table function that will allow you to define a sequence of, excuse me, a sequence of integers. So sequence from five to 10. If you wanna represent all the integers from five to 10 as a table, you know, one thing to do is actually physically create a table on disk and insert those integers into that table but that's a little bit wasteful, right? Because we sort of know what that sequence should be. We don't actually need to physically store it. Five to 10 may be not so bad. They're trying to store the integers from one to a million. It can get worse. And so a function that knows how to generate these on the fly can be useful. And so you'll see table functions be used in this way. Now, well, I think this will appear on the next slide, actually. So support for these are pretty comprehensive. All databases have them with unfortunately a notable exception being SQLite itself which is the one you're using for the assignments. So I encourage you to go out and look at other databases. In particular, Postgres and Green Plum which is a commercial database that is parallel and based on the Postgres code base has really, really excellent support for user-defined functions. SQL Server and Oracle IBM all also have great support for it but Postgres has a particularly clean interface and was really designed way back when with extensibility in mind. So it was one of the first, originally there was a research project and one of the main goals of the research project was to show that an extensible database was a good idea. Things where you could add your own functions, add your own types, add your own features in various ways. And so in particular, support for manipulating time data, timestamps is quite nice there. Okay, so you can define these user-defined functions in a variety of languages including pure SQL and you may have to sort of scratch your head to think about why that would be a good idea but the short answer is that it's the same reason why defining functions is a good idea in any programming language, right? It's this notion of abstraction, okay? And reuse, okay? And then there's programming language, imperative languages that are extensions of SQL and Microsoft's got one called T-SQL and Oracle's got one called PL-SQL and Postgres has one called PL-PG-SQL that adds things like we'll run a query and save the result in a variable and then reuse that variable later or we can define variables of just primitive types like integers and reuse those. And there's usually also looping constructs and conditionals and every other kind of feature you might have in an imperative language, okay? So these come up and they're useful and if you were a database administrator or you have friends that are database administrators, they're going to be very familiar with these languages. I'm not typically a big fan of using them not because I don't think they're a good idea but because I find them to be overused, I think there's things you can do without having to drop down into writing imperative program that are not always recognized. The other reason I don't like to use them is not for any kind of fundamental reason just because the experience of writing code in these languages is actually kind of painful. They're difficult to debug because they're not great support for debuggers. When things go wrong at runtime, they're sort of hard to figure out what's happening and so on. And so while I complain a lot when I see SQL logic being pushed into the application layer, I complain less when the alternative is to do it in one of these imperative extensions to SQL, okay? So don't put your joins in the application but you can put your loops, okay? So fine. So in Microsoft SQL server, any kind of CLR language from .NET can be used and they're typically in C sharp and so on. Python, there's an R extension to Postgres that I haven't had much of a chance to play with yet but that's kind of exciting because it gets the folks that are interested in using databases with their statistics routines a much more compelling argument of how to do so. Okay. So that's what I wanted to say about that and we'll see examples of user-fine functions come up in a few different places and most importantly, they'll all use the term and so I want you to know that the term comes from databases and what I mean is things, just any kind of code that is not provided by the system itself. All right. Okay, so we talked about algebraic optimization and then we talked about decorative languages on top of the algebra in order to simplify expression and in order to avoid specifying to the computer exactly how to do it, right? We're gonna leave that open and let the database figure that out but we stopped at what I'll call logical optimization and what I wanna talk a little bit about the physical level optimization and what I mean by this is that even after you specify we hinted at this last time but even after you specified the order of operations we haven't yet specified every detail needed in order to actually evaluate the query. Okay. And let me give you an example of that. So here's a simplified version of a query we looked at last time where we say for every order we wanna find all the corresponding items that were part of that order and that's it. Last time we had an extra condition. Oops, I'm actually clinking with the mouse but you can't see that because I'm on the wrong screen. So for every order find the corresponding items that match and the last time we had another predicate down here and this time I've taken that out. And so the algebraic plan that this translates into is very simple. It's just a join of the two tables and that's it. So you think we're done, right? We're gonna join, order an item and we're finished. Well we gotta specify how we're gonna do that join and so let me tell you about a couple of the options here. So one in sort of very high level pseudocode looks like this. We could say for each record I in item and for each record O in order, check to see if those two records agree on the order fields, on the order attributes and if so return it and that's a join result. Okay. You know they match. So fine. Another option is for each record I in item, insert that record into some sort of data structure and here I'm gonna call it a hash table. Now I'm not too concerned about what exactly that is. And then second for each record O in order, go look up the corresponding records in that data structure that we found or that we built and return all the matching pairs. Okay. And if it actually is a hash table that we're talking about, then this look up could be pretty efficient, right? It could be constant time, amortized constant time, right? And so now this one says, well for every record in item, go scan every single record in order. And so we have kind of an in squared complexity going on here and here we say, well for every record an item put into a data structure and then after that for every record in order, go look up those records in the hash table. And if indeed this is constant time amortized, then this is a sort of linear time algorithm. So there's two different ways. So I argue that there's two different ways to implement this join that both of these are valid. Okay. So which one is faster? Well, I've sort of hinted that perhaps option two is faster, but in practice it may or may not be. And so, you know, I would pause here and ask the class to answer the question, since it's over a video, I can't do that. I'll give you a moment to think about that. But I want you to think about why this one or in particular might be faster in some cases than this one, even though it seems like it shouldn't ever be. Okay. And let's see an example maybe in a second. So leaving that question hanging open, I wanna make the point that you have access to this underlying algebra. This isn't something that's purely sort of theoretical, right? This is something that you can use tomorrow if you work with databases at your job. For example, in this particular product, Microsoft SQL server, and in fact in all the products, you're gonna use the same sort of mechanism, but you can explain a query and that will give you access to some form of this algebra that I've been talking about. Okay. So if you take a query and here I've changed the query, I've changed the schema yet again, this table, this Reuters table is one you'll be working with in the homework. I've written a query here and I've explained it and what shows what the SQL Management Studio gives back to me is a little algebraic tree, kind of like the ones I've been drawing here just in PowerPoint. Okay. And so this one says a hash match is gonna be used to implement this join condition. This one's kind of a complicated join condition for a reason I'm not gonna explain right now, but it has two leaves and then they get joined with this thing called a hash match inner join. Okay. And so this is very much like the hash table example I gave on the previous slide, but I want you to take a look at something. So here I've taken the exact same query, but I've added an extra condition where I'm only looking for words equal to parliament. And I probably should explain this schema a little bit. So the Reuters data set gives you term frequencies. You have three columns, doc, ID, or let's just say doc, term, and frequency. How, and the frequency is how often that term appears in that document. Okay. So this is the table you'll be looking at. Right. And so here what I've said is I'm looking for pairs of terms that co-occur in a single document is the previous query I was looking at. And now I've said, well, look, I don't want all pairs of document or all pairs of terms. I only want terms that co-occur with the term parliament. Right. So this lawyer co-occurs with parliament frequently. So I'm looking for all the terms that co-occur in some document with parliament is what this query is expressing. Okay. So now what I want you to notice though, is that when I explain this query, I get a different physical plan. The logical plan looks the same. It's still got scan, scan, and a join, but the algorithm to compute the join has changed. And now it's this nested thing called nested loops. And that nested loops corresponds exactly to this pseudocode here. That's why they call it nested loops, the outer loop and the inner loop. So exactly the same thing. And so it chose to do this nested loops plan, even though we argued that it was an in-squared algorithm and it probably wouldn't be chosen very often. So why was it in this case? So if you think about it, the one of the sides of this join is only dealing with those terms or with the occurrences of the term parliament in a document, which is a very small relation. And so it's a very small relation and this nested loops algorithm could be very, very efficient and faster than dealing with the overhead of actually constructing this hash table or constructing some data structure. Okay. So the main takeaway here, as opposed to the details, is that different physical algorithms are appropriate at different times and that this decorative language and thanks to the decorative languages and thanks to algebraic optimization, the programmer doesn't have to worry about any of that. They don't have to make that choice. Okay, so this is a very, very powerful idea. You just express the query and the database does the rest. All right. So fine. And just to point out, this is not just something you need to SQL server. You can generate these kind of algebraic plans in Postgres by using explain. And in fact, they look kind of nicer and here's these hash joins again. This actually shows you, whoops, excuse me. This shows you where it's building the hash table as the step one and then probing it with the step two and same thing here. And this is another operator that we didn't talk about where you are, say you're going to count all the records that match some, count all the members of some group. I'll put it that way. And so the hash here is on group ID and you can apply your functions to the rest of it. But I shouldn't give such a high level of view of that or without talking about it more. So let me skip that all together. Okay. So fine, so the algebra really does exist. You can look at it directly just by using the keyword explain. And I advise you to do so. If you work with databases, you should be using explain all the time to try to understand what's going on. All right. Another point I'll make is just that this matters. This is not from directly from SQL and in fact, it's not from a commercial database. It's from some research that we do in my group. But the point is the same here. These are actually different physical plans for the exact same query. And in fact, here I'm doing something in parallel. So this is actually number of processors being applied. And so as you go from four to 16 processors, things go down a little bit, not as much as we'd like actually. They should be sort of going down quite a bit. But the point is that each one of these plans is doing a very different amount of time. Well, these two are kind of the same, but the difference is pretty important. So ignoring these opportunities and sticking with only the plan that the programmer specifies would be a big mistake. Okay. And then another illustration of this that's a little bit hard to stare at, but let me give it a whirl to try to explain what's going on here. This is by some very nice work by Haritza at all at VLW 2010, but there's a whole series of papers on this work. They try to visualize the space of possible query plans. And so what the two axes are here, this is all for a single query, but the parameters to that query are changing. And so this in fact says something about the supplier account balance. And this is a parameter on the sort of extended price and they change the values of these parameters in the query. So imagine the same syntax, the same select star from something, something where some condition equals extended price and some other condition equals account balance. And just by varying those two knobs, you get this really rich tapestry of different plans being selected by the optimizer. So each color in this space represents a different query plan, a different algebraic query plan selected by the optimizer. Okay. And so I think that the takeaway here is just that it's a very complex decision being made by the databases and necessarily so. They actually, these different plans actually matter. They don't show that here, but you can actually show that this choice of plan tends to, the database didn't do a pretty good job of finding the right plan. And that, I argued in the last slide that this can actually matter. The difference in the time can be pretty significant. Okay. So leaving this kind of complexity up to the programmer can be a big source of loss, right? Hiding this complexity is a huge, huge, huge win. Okay. Last time we talked about algebraic optimization. And I argued that all three of these expressions, without going into a lot of detail, but I argued that all three of these were equivalent and they differed only in the order in which things were evaluated. Here you evaluate this join first and this join second. And in this expression you evaluate this join first and this join second. And here you sort of find all possible combinations of tuples and then filter that. And so if you don't understand exactly what's going on in these expressions, that's okay. You're not going to know that you will talk about it in like this, in this segment I think. But the idea, the takeaway here is that there's three equivalent expressions and we don't know necessarily which ones, which one is the fastest one to evaluate. But the database can figure this out and does every time you write a query. Okay. And that's this notion of algebraic optimization. Now, we don't, you know, even if you're familiar with databases, you may or may not be familiar with the relational algebra, which should be strange because I've argued that it's, you know, the hallmark of databases and totally fundamental. So why don't we think about programming databases in terms of writing relational algebra expressions? Well, another good idea, another key idea that's associated with relational databases is this notion of declarative languages. And what we mean by declarative languages is that you specify the answer that you want, but you do not specify anything about how to get it. And so a relational algebra expression actually does specify an order, right? As I showed on this slide, here's three different expressions that's indicating exactly which order to do every operation. That means that some, you know, if you write an expression like this, you're instructing the computer, look, do it in this particular order, okay? And so these declarative languages say, look, we're just going to describe the properties that must be true of the result. And we're gonna let the database figure out the right order in which to do this. And so here's a quick example. So imagine you have two tables. One is order with three columns, order, date, and account. And another table with item with two columns, order and part. And the semantics here is that this column indicates which order that item should be associated with. Okay. And so if you want to say find all orders from today along with the items ordered, then you might write this query. And if you've seen SQL plenty of times before, bear with me and if you haven't, then pay attention. So select star, give me all possible columns from the table order and all possible columns from the table item. But I only want records such that this condition is true, where the order column from the order table matches the order column from the item table. And further, I only want orders from today where order.date equals today. So this is just conditions expressed over the results without any kind of idea of how to actually get this answer. So what automatically happens is that this query is translated into a relational algebra expression along the lines of what we've already seen. Here I've sort of just done a cartoon where you can say scan the item table, scan the order table, select the records such that date equals today, and then perform the join, find all the records in order that correspond to the, that have court, you know, for each record in order, find the corresponding records and item that match on order. Okay. So this is happening every time you run a query again. So the SQL is the what, not the how. Give them another example. Here's three columns, product, purchase, and customer. And there's the underlining here we haven't talked about, but this is the indicating what makes the table, what makes these records unique? And so here the PID makes the product unique, the CID makes the customer unique, and the combination of PID and CID makes the purchase unique. Okay. So here's another SQL query. We say select distinct product name. Why do I know it's the product name? Cause I see an X here and I see an X here. And the customer name, and I know it's the customer name cause I see a Z here and I see a Z here. This is an alias for the relation customer. From these three tables, where the product ID in the product table matches the product ID in the purchase table, and the customer ID in the purchase table matches the customer ID in the customer table. And that's a typo. It's like that should be Z. So maybe change that on your own slides. Yeah, let me see if I can fix it now. Z.CID. And then we want, but now we want only the products for which the price is greater than 100, and we only want the customers whose city is Seattle. All right, so what does this say in English? Well, if I meet combinations of products and customers, unique print combinations of products and customers where the customer is in Seattle and they pay for a product worth more than 100, okay? So it's clear what we want, but it's unclear how to get it. It's kind of a complicated query, okay? So translating this into relational algebra, we have this. So at the bottom we have product and purchase, and now we do this join where we say for every product, find me the corresponding records in purchase, then we do another join for every record in the result of this join. Find me the corresponding records in customer, right? Now filter out all those records such that where price is not greater than 100. We only want the ones where price is greater than 100, and we only want the ones where city equals Seattle. And then we want to, in this case, project down onto these two columns. What I mean by project is get rid of all the other columns except for the two we're interested in, okay? And then finally take the final answer. So the two points here is that the execution order is now clearly specified, but there are a lot of physical details are still left open. This is a very high level indication of what's going on. Order of operations is clear, but that's about it. We don't know how we're gonna do the join exactly, and there's multiple ways you can do it. I've indicated that we're gonna take for every record in product, we're gonna look up a corresponding record in purchase, but we haven't said precisely what that means. Okay. I'll give an example of this in a second. So another example here. Here we only have a single relation called R, and it's got three columns, subject, predicate, and object. And you see this kind of schema when you work with RDF data, the resource description framework. And RDF is a language and formalism and software stack for managing what is called linked data. And here, sort of everything is, it's a set of all facts. Any kind of fact you can come up with, you can encode in RDF. You can say, the instructor of this course is Bill Howe. So here the subject might be this course, the predicate is has instructor, and the object is Bill Howe. Okay. And so people use this formalism as a very general way of encoding any information from any source. And we might touch on this much later in the course. Okay. So here's kind of a complicated query. But what it says is, I'm gonna have three instances of the same relation, and I'm gonna join each, I'm gonna join them all up. And I'm gonna look for a sequence of tuples such that we have a person who knows another person who holds the account of a company who has an account homepage of a particular value. All right, so you're looking for this sequence of, where this root edge is knows, and this edge is holds account, and this edge is account homepage. So find the all possible combinations in this table where I've got, you know, find me all A, B, all instantiations of A, B, C and D such that this pattern matches. Okay. And all lines are specified by these conditions. The object of the first relation must be this, it must be equal to the subject of the second relation, and the object of the second relation must be the subject of the third relation. Okay. So I'm looking for patterns in the graph that look like this. And in relational algebra, you see this query gets translated into this form. There's a selection to find predicate equals knows. There's a selection to find predicate equals holds account, and there's a selection to find predicate equals account homepage. And then you join, and then a sequence of joins, and finally a projection just to pull out the final answer that we're interested in, which is a slight clause. Okay, so perhaps a complicated example, but I think the takeaways here, I wanted to mention RDF, because it might come across it again. And I also want to demonstrate that you can access the same relation more than one time in a single query. And then I wanted to give another example, of translating even complicated queries into relational algebra expressions. Okay. Okay, so we talked about physical data independence, and we talked about algebraic optimization. I want to talk about another kind of data independence, which is logical data independence. And so, we argued that physical data independence was this ability to insulate applications and protect applications from changes in the physical organization of the data. All right, so things were rearranged on disk. We don't want to have to rewrite all the code in the application. And this is what databases provide, and relational database in particular do a great job of providing this. But if you go back to Ted Codd's first paper, and even the quote I gave you, he talks about insulating applications from the internal changes to the internal representation, but also insulating applications from changes to some forms of external representation. And what he means by external is things like adding a column to a table. So this isn't an internal shuffling of the bits on the disk. It's actually a logical change to the table. There's more data there than there was before. But if you think about it, if your code doesn't care about that new column, you shouldn't have to rewrite it just because there is a new column. Okay. So the ability to provide this logical data independence is provided by this concept of views. And all relational databases have this concept. And somewhat surprisingly, I find them to be somewhat underused in practice. Right? And well, okay. So if you're using them right now, if you know what they are great, if you're using them even better, if you use databases but have never heard of views, then this is a great time to learn about them. Okay. So what is a view? A view is just a query with a name. So I write a query, I give it a name, and I put it in the database. Now, I can then access that view as if it was a table in the underlying database itself, as if it was a physical table. So why can we do this? Well, I talked about this notion of algebraic closure before. Right? And here, and this is exactly what empowers, what allows us to do this. So we know that every query returns a table, right? We take tables on the input, we do some manipulation of them and we produce tables. So we say that the language is algebraically closed. And so any result of a view will always be something that we can then add other queries on. So we can stack queries on top of queries on top of queries on top of queries. Okay. So why might we wanna do this? So one reason is to protect the underlying data. You can assign permissions to tables. So for example, if you only want a particular user to see data associated with their account, you can write a view that filters everything out except for their account and then grant them access to that view. The most direct benefit of this is that it allows you to expose data according to a logical organization that makes sense for the user. So even things as simple as hiding some join. If you decide to reorganize your data into two tables, requiring that programmers use joins to link them back up again, you can simply write a view that hides that join and let everyone access the result of it. Now, maybe this may sound expensive, but the cool trick here is that because of this algebraic closure, what happens is the user's query gets composed with your query that defines the view and the whole thing gets sent as one big block to the database for evaluation. So the database simply doesn't care whether it came as a view in the user query or whether it came all as one query directly from the programmer. It's gonna optimize it the exact same way. Okay, so there's nothing but a benefit here. Okay. So let's see an example. So given the schema purchase and product, define a view called store price with two columns store and price that has this definition. So select store and select price from purchase and product where the product IDs are equal. This is a little funny because we didn't put pids here. So this is a little bit wrong. So assume that each one of these has a, let's assume that this is PID and assume that this is also PID and then it matches the query down here. Okay. So this result is now like a new table and just like I said a second ago, you've now hidden the join from the users. And so complexities like these column names perhaps, you can insulate your users from and you can name them whatever you want. And so this allows you to put this, this is what this logical data independence means is that even no matter how I wanna logically organize my tables, I can expose a different perspective on the data than I wanna have myself. And so this separates the people who are administering the data from the ones who are actually accessing it. Okay. Logical data independence. Key idea. All right. All right, so how do we use a view? Well, as I've said, all you have to do is reference the view in a query just like it's a table. And so here, if we wanted to find the notion of a high end store and we say, well that's any store that has sold some product to over a thousand dollars. And you know, for each customer, we may wanna find all the high end stores that they visited. Well, being able to directly reference the store price relation that we defined in the previous slide, this view, helped simplify this query, right? And so you can just write a query that directly accesses that view as if it was a table. And okay, so how is this actually evaluated? Well, that's actually, oops, that's actually what's really fantastic about databases is that this query will just be folded together with the view definition and pass to the database where the whole thing is optimized in one go, optimized in one go. So you don't need to worry about the difference between having a stack of five views and all being compiled together or all being folded together as one query. The database doesn't care. It's gonna translate the whole thing into one big query, exchange that for an algebraic expression and then do the normal optimization procedure to come up with the best possible plan. So it's basically like free abstraction, right? It simplifies things for the programmer without any kind of performance cost. Okay, now you can actually get better performance than writing the whole thing by hand. It's equivalent to writing the whole thing by hand. You don't pay a penalty but you can actually do better than that with views in some cases by materializing views. And we're not gonna talk too much about that because it's sort of very specific to databases and we don't see it quite as often in this broader context of data science that we're trying to talk about. But it's a good trick and once you have the mechanism to store views, you can essentially cache the results and that's what we call materialization. Okay, so the last key idea I wanna convey about databases is that of indexes. So while indexes are certainly not unique to databases, databases are perhaps unique as a platform that can make them very easy to apply and deploy and automatically take advantage of. And so databases are especially but not exclusively effective at sort of needle in the haystack problems, looking up individual records or small amounts of records from large data sets. They do other things very well too but this is one thing that they're quite good at. And the reason is that they can apply indexes. This second part is a little bit of context but what I mean here is that if you're trying to write code to do this yourself in say some programming language like Python or C or R, you're going to be a slave to what sizes of data fit into main memory as we said before. Now, you can absolutely be clever and start bringing in one chunk of data at a time into memory, processing it, putting it out to disk and bringing in the next set and so on. But the code will very quickly become very, very complex. This is something that databases already know how to do. And so your query will always finish regardless of database size as long as it fits on disk. It doesn't matter how much memory you have available, it'll eventually finish. It may or may not be that fast but it'll finish. They already know how to take advantage of main memory in this optimal way. And it's not easy, right? It's a pain in the butt to try to code that yourself. So effective use of the memory hierarchy, effective use of indexes, these are things that databases can do well. It's a great platform for applying these tricks. And so finally, what I mean here is that the indexes are easily built and automatically used by the optimizer. So to create an index, you can write a statement like this. Here I've changed the schema on you once again. But here we're sort of filtering on genetic sequences. And if I create this index, then this query will, this very simple query is looking for all sequences that match a particular value. It will automatically take advantage of that index if it's there. You don't have to tell it to do anything. You write the exact same query, okay? Welcome back. So this time I want to talk about what this term scalable means. We made the point that working with really large data is important aspect of data science. And we've mentioned the word scalability before. We haven't talked about really what that might mean. So a couple of different ways to think about it that I want to talk about here on this slide are here. So operationally, and in the past, one way to think about this was look, it needs to work on data that doesn't fit in main memory on a single machine, okay? So maybe you still only have one machine to work with, but this means that if you need to be able to bring data off of disk in pieces, operate on it, and then maybe write it out in pieces, okay? And so a bundle of algorithms that could work on data in this fashion by bringing in data piece by piece to memory, such that the memory footprint at any given point was small. This is something the database has provided, all right? So you could write a query and you knew for sure that it was gonna finish as long as the data was there on disk, and you had sort of a minimal amount of memory, at least to get started, okay? But, and I might use the term out-of-core processing here. So out-of-core means it uses the disk to operate. So in-core means the entire, everything you're doing fits entirely in main memory, out-of-core means you need to sort of work with the disk appropriately. And so databases were, the database community were specialists at out-of-core processing of large data sets, but increasingly this notion of scalability wasn't really enough, okay? And so you saw this pretty acutely with websites that were coming online in the 2000s where one big server, no matter how big you bought that server, you couldn't bring data off of disk fast enough to meet all the requests. And so you had to start being sure that things were in memory, and the only way to do that is to start adding more machines, okay? And so increasingly, especially, you know, Google especially is sort of known for this, although many of the large media companies do this, is that scalable really kind of means being able to use up to thousands or maybe even more, tens of thousands of cheap computers and apply them all to the same problem. And so we might call this scale out while getting bigger and bigger and bigger main memory and more and more cores perhaps would be scale up, fine. So another way of looking at this that maybe is a little bit more precise is to think about this kind of, in terms of algorithmic complexity that you may or may not be familiar with if you know how much computer science you've taken, but let me give you just a flavor of what's going on here. So in the past, you might call an algorithm scalable if given in-date items, your algorithm does no more than in to the M operations on it, okay? So this may be in M maybe one, in which case it's a linear time algorithm or maybe two, in which case it's a quadratic time algorithm and so on, but this was deemed, you know, tractable, right? And so you'd prove properties about, you would prove that you could find a polynomial time algorithm to solve some problem and it was sort of thought to be scalable or assumed, you know, well, assumed. That was the definition of what scalable was. Things that were non-polynomial that took more than this, for example, exponential, where you might have M to the N, were exponential time algorithms and they grew much, much faster and they didn't scale, okay? But, you know, this isn't a very tight bound on scalability in practice, right? A quadratic time algorithm may be sort of feasible. You start getting into the fourth and so forth that becomes pretty difficult to do for very large data sets, okay? So now you would say that it really can't just be into the M, it's gotta be into the M over K over some, for some pretty large K. So you have to have a lot of K being the number of computers you can apply to the problem and so you have to come up with an algorithm and they can really exploit this properly, okay? And then one more point that I'm gonna make now but we're not gonna return to in this segment but I hope to at the end of the course is that it could be that soon, even this isn't good enough and for in-date items you really should do no more than in log in operations and so the in here means for every data item that comes in over the wire. So this is applicable to sort of streaming applications and the data is coming in so fast that you only get one pass at it. So for every operation I have, I'm allowed to process that data item and then I'm allowed to put it in some sort of an index and that's this log in factor, okay? And so whenever you see log you should think trees. So I'm allowed to take each item, inspect it and work with it and then stick it into some tree data structure but that might be about it. It's just too big to make multiple passes at, okay? And so examples of this might be this large synoptic survey telescope that we heard about if it's taking sort of 30 terabytes a night you can't sort of make too many passes on this data at one time, okay? And so this whole area we think of as streaming data which I guess I have written here but I'll write it again. And we'll come back to some of the techniques for dealing with big data in a streaming context. Okay, so fine. So two different views of what scalable might mean and we're gonna talk in this segment to give you some examples and some intuition for this. Can we make use of lots of computers in this in over K? Okay, all right. So here's a little example problem that's admittedly somewhat oversimplified. So we wanna find all the matching DNA sequences where a set of sequences is a short string consisting of the letters G, A, T and C and you're given a short sequence and you wanna find all the ones that just exactly match that, okay? So finding all the sequences that are exactly equal to this one that you're given. So how much you do this? Well, with this little cartoon imagine that each one of these black lines is a sequence. All right, so this black line corresponds to this sequence and this black line corresponds to this sequence and all the other black lines are other sequences. And, you know, think to yourself for a minute, propose an algorithm to find sequences matching your target sequence, okay? We'll make you know assumptions about the data whatsoever. It's just given to you as a list. One thing you can do is do sort of a linear search. And so we're going to inspect the first item and we're gonna compare it to our target sequence. And if they're equal, great, we found one. And if they're not equal, what do we do? Well, we move on to the next one. So this is not equal and this all happens at time equals zero. And then we move on to the next one. And so at time equals one, we check another sequence and compare it for equality and if it doesn't match, we keep moving on. And so on and so on until we get to time 17 where we find a match. And here I've said contains instead of equal. I guess I changed the meaning here. But so yes, we found a match and we sent it to the output. Okay, so how long does this take? How many operations do we do? Well, we did 40 records. I'm sorry, we are given 40 records in this cartoon and we made 40 comparisons. And so with in records and in comparisons, we say that the algorithm of complexity is order in. Okay, so this is a linear time algorithm for this simple search and retrieval task. So the question is, can we do any better? And if you've had some experience thinking about data structures, taking some data structures classes, you should be thinking, yes, we can. All right, so one way to do this is to sort the sequences. So how does this help? Well, certainly we could still do the linear time algorithm and inspect these guys one at a time. But we can also do something a little bit smarter. What if we start in the middle? So start in the middle and compare our target sequence to the sequence we found here. And they're not equal, but we can see that this one that we found is less than our target sequence. Okay, so we know that the sequence that we're on is to the left of our target sequence. We know the target sequence is to the right. Right, it must be somewhere in this direction. So we've just removed the need to check half of the data, right, 20 records. Okay, so now jump to the middle of this guy and compare again. And now we see that, well, boy, we overshot, this one is greater than this one. Well, once again, let's see, here we've removed half of these guys on the first step, and here we now we've removed half of these guys. And now we know it's in this range. Skipped it a half again. And compared this one and now it's less than, so we're sort of bouncing back and forth around our target. Let's see if I can draw this a little better than I did. Cross those out. Cross those out. Now cross these out. And so we know it's somewhere on this side. And in the next step, we find a match. Okay, and here if we have multiple copies of the same item, we know that they'll appear next to each other so we could just walk through the records gathering up all the ones that match if we needed to. Okay, so how long did this take? Well, here we still have 40 records, but we only made four comparisons. Right, so with in records we made login comparisons. We navigated this sort of implicit binary tree. We did a binary search over the sorted data. Okay, and so this lookup was order login. Now we did have to sort the data ahead of time and if you have to include that then that's an in login operation that we're not gonna necessarily talk about. Once you have that sorted data available to you, it's now, it takes login operations. And so this is perhaps far better scalability. And this is a good trick, and it's such a good trick that it's been baked into many systems, especially I'll argue relational databases. And we made this point before, but I wanna make it again, is that databases are good at these kinds of needle in the haystack problems, right? Distracting small results from big data sets, they can transparently provide this sort of old style of scalability. What I mean by old style scalability is that it fits in main memory as we've said a couple of times. Your query will always finish regardless of the size of your main memory. In addition, you can, it makes an excellent sort of index platform, a platform for building and using and reusing indexes. Okay, so relational databases are good at this old style scalability in the sense of out of core algorithms. And they're also good at this old style scalability in the sense of finding logs, right? Finding logarithmic time algorithms. So indexes are easily built and automatically used when appropriate. And we've talked a little bit about this during relational databases. You can write a single statement, create index, give it a name on a table and a column name and it will sort records according to that column. The actual data on disk may or may not be physically sorted depending on the details of which system you're using and how this is working. And typically in this statement, it would not actually be moving the physical records around but it would build an auxiliary index. But regardless, you get to take advantage of this logarithmic time access pattern. Okay, and so just by writing this one statement in one line of code, you can create the index and take advantage of it. Okay, and then every query that comes afterward that needs to use that, that would benefit from using that index is able to. The optimizer automatically selects the correct index if it's appropriate to use. So this is much easier than you having to sort of rewrite your code by hand in order to either make it out of core, right? Or to take advantage of an index. Okay, so when you're comparing relational databases to say scripts in R or scripts in Python, there's a lot of algorithmic work that's already been done for you that you're getting for free just by turning your problem into a SQL statement. Fine, so you're buying into a lot of code if you can tie one arm behind your back and write it as a SQL statement. It's not just SQL versus, versus a much more expressive language like code. You're actually getting a lot of benefit out of doing though. Okay, so let's look at another task called read trimming. So here we're given the same set of DNA sequences, but instead of searching for one particular sequence, we're going to trim the final few base pairs from each sequence. Okay, so we're gonna trim off a suffix and return the data set where each read is now just a prefix of a former read, okay? So fine, so how do we do this? Oh, and the reason why you need to trim off the suffix is it actually comes up in practice and the reason is that the accuracy of the sequencer drops off fairly abruptly after a certain length of read and trimming off the last several base pairs from every single read is kind of a standard pre-processing operation. Okay, fine, so how do we do this? Well, we can do the same trick that we tried the first time with the search task, meaning that we can process each record in turn one at a time, right? So at time zero, we can trim off the suffix here and just return TACCT and in time one, we can trim off this suffix and so on. And in time 17, there's our old friend that begins with GTA and so on. But here, unlike the search task, there's no index that's really gonna help us, right? We have to touch every single record and manipulate it, right? We have to take a prefix from it and remove the suffix. And so the operation is fundamentally order in, right? There's not gonna be an algorithm that is less than order in, right? You have to at least touch every single record. Okay, but can we do any better? Well, yeah, right? Processing the first task is completely independent from processing the last task, which is completely independent from processing this task, and I keep saying task, processing the record, okay? So while there's no index, we can break this data set into pieces and process each piece independently, okay? So imagine we take our single data set and break it into these chunks and assign each chunk to a different machine or maybe a different processor to be a little bit more general. Now at time zero, we can process one sequence from each chunk all at the same time. And at time one, we process the second sequence from each chunk all at the same time and so on. And so here, how much work did we do? Well, we did the same amount of work. We still processed all 40 records, but how much time did it take? Well, it only took seven, I say cycles here, seven time units to be a little more general because we were given these six workers. And so the complexity here is in over K, right? For every item we can divide, we do on average, or for in items we do in over K, time steps were completed to work, okay? Last time we talked about scalability and we argued that scalability really means working in parallel. And we talked about this specific task, which is, you know, read trimming, okay? So this is a bunch of small genetic sequences and your task was to trim off the last few characters from each one, okay? And we sort of showed that this is pretty simple to think about in parallel. You would divide the set of reads into chunks and put them all on separate computers and process them all in parallel here. You know, there's a function F that takes a single read and trims off the last few characters and returns the prefix. And you can apply this function in parallel and you can get out the data set you want, which is a set of trimmed reads, okay? So let's see some more examples. So a new task that was actually needed to be performed at the New York Times in 2008 and they have a couple of blog posts about it that you can read was, you know, simplified version was to convert a bunch of TIF images into a different format. And what was really going on here is that they had digitized images from the newspaper along with some information about the optical character recognition. So some extracted text and they wanted to turn this into a more web-friendly format. And so they had to convert the images to a web-friendly format. And they also had to convert the extracted text into a little package of JavaScript code, okay? So they're gonna get ready to put this stuff on the web. All right, so this is 405,000 images, which was quite a bit, especially at the time, okay? But the schematic looks sort of similar, right? You take a big set of TIF images and you split them into chunks and put them on a bunch of different computers and you have a function F that converts a TIF to a PNG and does the other work too, let's say. And what you get out of the data is that you want a bunch of PNG images, okay? And they're distributed across these machines, fine. So let's look at another example, right? So now we wanna run thousands of little simulations. And what we have are the parameters to each one of those thousands of simulations. And so an example of this at the URL here at the bottom of the slide is from simulating muscle dynamics. And this comes up a lot. You have these Monte Carlo simulations that need to do, they understand sort of phenomenons stochastically by running lots and lots and lots of simulations with different kinds of different inputs and then kind of averaging the results, okay? As opposed to modeling everything precisely. So now you wanna run thousands of simulations. Well, you have a set of inputs of parameters to these simulations and you break them into chunks and put them all on separate machines and apply the function. And here the function is actually running the simulation. And what you get out is the output of the simulation distributed across all the machines. Okay, so there's, you know, a pattern should be emerging here, right? So another example. So imagine each one of these little bars is a document and your task is just to find the most common word in every individual document, okay? Well, same thing, distribute the documents across the K computers and then your function F now in this case opens up a single document, figures out which word is the most common in that document, right? And just produces that word. And so now you have a big distributed list of pairs where, you know, the first part of the pair is the document ID and the second one is the word, okay? So that could be useful, but it's a bit contrived. You know, consider a slightly more general program that computes the word frequency of every word still in a single document, right? So instead of just finding the most common one and producing that, now you're gonna produce an Instagram of the frequencies of every word in the document, okay? So given this input, you produce all of these items, right? A set of items. The only reason I'm making this distinction from the last one is that the last one took a single document and produced a single word. Now we're taking a single document to make it clear that that's allowed, okay? Hope this animation isn't here, but I don't think I'll bother fixing it. So you have millions of documents, you distribute them again, now your function returns a set of word frequency pairs, right? But that's okay. And now we have lots of little lines here, each line representing a single word, let's say. So they're not one to one anymore, but that's no problem. The function just returns a set of things. All right, so fine. So there should be a pattern here, right? We have a function that maps a read to a trim read. We have a function that maps a tiff image to a PNG image. A function that maps a set of parameters to the simulation result, right? It's the simulation itself. We have a function that maps a document to its most common word. And we have a function that maps a document to the histogram of its word frequencies, okay? So, so good. So these kinds of tasks we think we know how to do in parallel, given a big set of objects and a function that knows how to process a single object, you know, you should be able to think about how to parallelize this, right? I'm giving you a brief, but we can abstractly understand how this is done, right? So, let's say that one more time. We have a big set of objects, we have a function that maps a single object to the other computers. The objects among the computers have a function that's in parallel. This is trivial. Okay, so what if we want to compute the word frequency across all documents, not just set a frequency for each document, or set a frequency for a single document, okay? So here, you know, if we have these three documents, now we want to get a single histogram that counts up the number of times the word people appears across all three of them and the number of times the word government appears across all three of them and so on. So, let's go back to our schematic here, the pattern. Well, now we want to compute the word frequency across five million documents and we can still distribute them among K computers, you know, so far, so good. And then for each document, we return a set of word frequency pairs and now I've switched the notation here from F to MAP, since we can consider this a MAP. That's the terminology I used. Okay, but now what do we do? All right, so what we could get out here, what we will get out here is a set of frequencies, but that's not what we want. We want a one big histogram. And to build this one big histogram, we have to make sure that a single computer has access to every occurrence of some particular term. So if the word history appears in some document on this machine and it appears, you know, twice in this document and three times and two times in documents on that machine and so on, we have to sort of group those all up and send them to a single place just so we can count them. Okay, so let's look at this again. So if we distribute the documents across these computers, we map, apply our map function to each document in order to produce a set of word frequency pairs and now we have a big distributed list of these words and sets of word frequencies. And now we wanna get these workers involved in the process and these guys are gonna be the ones who count the occurrences of a particular word. Okay, and so imagine all these little colored red lines are occurrences of words such that all the red lines represent a single word, occurrences of a single word and all the green lines represent a different word and so on. Well, so these guys are gonna get sent to their respective locations. So such that this worker is in charge of handling all the occurrences of the blue word and this worker is in charge of handling all the occurrence of the red word and so on. Okay, so now instead of lines that go, you know, from one to one, we have lines that go from this one computer to a bunch of different computers. Okay, and so on, right? So it's just sort of shuffle the data to spray this data out across the network in order to regroup it now, fine. So now that we have the data grouped the way we want and partitioned the right way, we can apply another function which I'll call the reduce function which in this case it doesn't mean very simple it just counts them and that allows us to produce our final result which is, oh, there are four green words and four red words and three blue words and so on. Okay, so now the schematic looks a little different. We have a sort of a two step process. So we start with some large set of objects distributed over a bunch of machines and then we want to apply some function F to each one of those objects and that was our first step but then the output of those functions are all going to be redistributed across the network and grouped to form groups, okay. And then the second step is to process each one of those groups. So I'll write, instead of F, I'll write map here and I'll write reduce here, okay. And so that's exactly what map reduce does and we'll explain this in more detail next time but the key idea here is that the user, the programmer is going to write these two functions a map function and a reduce function which are serial and what I mean by serial is they're not parallel. You don't have to worry about how to manage, how to program a distributed cluster within each one of these functions. You just write a function map that takes in an object of some kind and returns some other object and I'm not using objects in this sort of object oriented sense here, this is sort of in the mathematical sense, just any sort of input to produce any kind of output, okay. And then the reduce function takes a set of objects which I'll note this way and returns some other kind of object and actually this can be a set of objects as well and in fact this can be a set of objects as well. So maybe I'll switch colors here. This can actually be a set of things and this can actually be a set of things. So for example, we saw a document returning a set of word frequency pairs. But you can think of it, it's honestly important to understand what's going on here. Okay, so this is an interesting hypothesis, right? That perhaps all distributed algorithms can be expressed as sequences of these two step operations, right? A map followed by a reduce and then maybe more map reduce, map reduce, map reduce as needed, okay. And we'll talk about that in more detail next time. So last time we talked about parallel processing as a lead up to map reduce and we ended up with this schematic here in the context of this example where we're counting words across a set of documents which is sort of the canonical example to start thinking about programming and map reduce. And so each one of these vertical black lines represented a document and we split them into smaller sets and send each one of those sets to a separate machine and then we applied our map function to each one of those documents in turn. And so if the map function, if you recall, took a single document and produced a set of pairs and each pair was a word along with a count of the number of occurrences of that word in that document. There are variations on this that you can imagine, okay. Now this word may have appeared in multiple documents, one here, one on this machine, one on this machine and so on. And so now we need to group them all together onto a single machine so that we can count them. And that's exactly what this shuffle phase did. Okay, so here I've written four different tasks processing sort of looking like it's processing a single group at a time. But, you know, no one should think about how many map tasks do we, how are we gonna have and how many reduce tasks are we gonna have? Well the map tasks are one per document. We have to call the map function. How many invocations of the map functions are we gonna be? Well we're gonna call it once per document. How many invocations of the reduce function are we gonna be? Well it's the number of groups that are produced by the output of the map function. In this case it's once per unique word appearing in any document, okay. And so in some sense the number of machines we need to apply to this problem is maybe kind of predictable in the map phase. It corresponds to the size of the input data set which we presume to know. But the number of reducers we're gonna need is maybe not known ahead of time, right. It depends on the size of the map output. Here we might be able to reason about it because we maybe know how many words there are in the English language and we can assume that with a big enough set all of those words will be represented at least once. But in general it's dependent on the output of the map so you don't really know, okay. And so the only point I wanna make is that we made a decision here to draw it as four different machines. But it may be the same six machines use the map phase it may be 1,000 machines and so on. Nothing's stopping you from sending all of the word occurrences to a single machine and having this one machine process the green group then process the red group then process the blue group and so on. Or maybe you would do four at a time because there's four cores in the machine, okay. But that wouldn't be as perhaps as efficient because it would be doing a lot of serial work. At the other extreme you might think, well we're gonna need millions of tasks let's allocate hundreds of thousands of machines to process them so that each machine is doing very little work, okay. And that might make sense but then the trade off is perhaps sort of spinning up all these machines and preparing them to do the work, okay. So there's a decision to make there and we'll come back to that, okay. But let's talk a little more about MapReduce itself. This is, I'm belaboring this for a reason because what I want you to do is start thinking in terms of MapReduce. Every problem you have, think what if the dataset was absolutely enormous, way too big for a machine, how are we gonna split it into pieces? And a very good way of thinking about how to split things into pieces to think about how you'd write a MapReduce program to do whatever it is you're trying to do. And so this is, yet again the same example just drawn in a different way. So here the input is document ID followed by a value and the value here is the entire text of the document and the map function, just to make this clear, produces a set of things, not just one thing. Then they're shuffled to produce this. So this is word one with a count of one, word two with a count of one, word three with a count of one and so on. And then on the other side what we get is word one with a group of all the occurrences, one, one, one, one, one, one, one, one. And then finally the reduce function counts them all up and finds that there's 25 occurrences, okay. So I'm probably, because if this is completely obvious you can always fast forward. I guess one of the beauties of doing this online. Okay. So fine. So what is MapReduce? Well, that's the programming model we just described and there's a paper in 2004 that's on the reading list that describes this. And there's a couple of key motivations for doing this in that paper that I think sometimes get lost when you hear about the popularity of MapReduce today. And we'll talk about those two benefits in a moment. So one thing to realize is that MapReduce refers to the abstraction and it's the name given to it by the authors of this 2004 paper. Hey, Dup is an implementation of MapReduce that came a few years later and was written by some people at Yahoo originally and then became an open source product that is managed by the patchy and has lots of contributors. Okay. So the key idea for MapReduce was really this programming model, which it says here, right? Now, it had a system with it as well, but the programming model being able to express lots of different tasks and have some sort of implementation automatically turn that into a parallel job turned out to be pretty powerful, right? This was an attractive way to write parallel programs. Again, because you didn't actually have to worry about the parallel. All you did was write a serial map function and a serial reduce function and the parallel wasn't happened for free. And so the evidence that this is not so much about the system as it is the programming model is that you see MapReduce implementations appear in other contexts. There's people who have implemented MapReduce over GPUs. There's a little bit of MapReduce on multi-core machines in shared memory. There's people who have implemented MapReduce on high-performance computing platforms on groups of mobile phones and so on. This goes back to one of the motivations for this course where I want to focus on abstractions where possible as opposed to tools. We're talking about MapReduce, the programming model, but we'll spend a little bit less time on the specific implementation at Hadoop, although you will have a chance on an optional assignment to work with Hadoop directly. Okay, so fine. So what is the data model of MapReduce? It's this bag of key value pairs. I mean my bag is a set that might have duplicates in it, and so we've seen that before. The document ID with the value, that's a key value pair. Sometimes on the input will be a little sloppy and not worry about precisely what the key and the value is. For example, if you're just given a record, you can assume that it's, say, the entire record is the key, okay? Or a document sometimes even, we may not have a explicit document ID, but you can assume the URL or the file name or something is the key. The output of the mapper, though, the distinction between key value gets really, really important because that's what controls the shuffle, as we've seen in that example. Okay. And so both the data model here is all about key value pairs and the input is gonna be a set of key value pairs and the output is gonna be a set of key value pairs. And the point is that the set of key value pairs can get arbitrarily large, right? We're gonna be able to process this set no matter how big they get. There is kind of an implicit assumption that the key and the value are small. And small here doesn't necessarily mean very, very small, it just means that it needs to fit on one machine. There's no support for if value grows to be terabytes, it's not gonna work. And so a document fits on one machine, that's okay. You know, an image fits on one machine and so on, okay? So fine. So the map phase, as we've said, you provide a map function, the input is an input key and an input value and the output is a bag of intermediate keys and values. It doesn't have to just produce, you know, taking a single input to produce a single output, it can produce a set of things. And we saw this with the word count example. A single document came in but a set of things came out. That's okay. And then the reduced phase, what you're given is the intermediate key, this will be the same intermediate key that was produced by the map phase, right? One instance of the same intermediate key produced by the map phase. And then a bag of values that were associated with that intermediate key. And they may have, and the key, the important thing here is that they may have come from any mapper whatsoever. The grouping into this bag of all, of everything that shares the same intermediate key is handled automatically by the system. So the system will group all pairs with the same intermediate key and then pass that bag of values to the reduced function. And the implementation details of whether this is actually candid to you as a bag or whether it's an iterator, if you're familiar with that term that you could step over is implementation dependent, but it's not important to think about. It's a collection of values. Fine, so here it is all in one slide. The map function takes a in key and an in value and produces a list, a bag of out key intermediate value pairs. And the reduced takes an out key and a list of intermediate values and produces a list of out values. I don't think I like this slide. I think I've preferred the earlier one. The one thing I will mention is that the terms map and reduce, I tried to motivate that in the last segment where you can think about converting a TIF image to a PNG image. You can think about mapping a function that maps every TIF image into some PNG image. And that's where the term came from. And if you look back, this kind of came from the functional programming community to use these terms. It doesn't precisely mean the same thing, but it's inspired by that. Okay, all right. So here's maybe the implementation for the example we have. And a lot of times what I like to do is sort of ask you to pause and stare at this code for a little bit and think about what it does. Here we've kind of gone through the example a lot, so I'll reveal the secret. But it's still instructive to work through this for a moment yourself. And in fact, maybe I'll end this segment there and you can stare at this and make sure that you understand what it does. Okay, last time we talked about map reduce and gave the abstraction and went through some examples. And we ended up on this slide, which is maybe the first time we've seen pseudocode or any kind of code that actually implements these map and reduce functions. And so I asked you to sort of take a look at this. Actually, at the end of the last segment, there are a couple of changes, mistakes that I've fixed in this slide. So you can compare the two and see if you can figure out what the mistakes are and why I changed them. Okay, so let's walk through this. So what does this code do? Well, as I sort of gave away last time, this implements this word count application that we went through schematically with cartoons and this is the pseudocode that actually implements that. Or an example of pseudocode that could implement that. You can't execute this code since it is just pseudocode. So what are we looking at here? Well, the input, as we said, the data model of map reduces key value pairs. So the input is gonna be a big set of key value pairs and the map function is gonna operate on one of these key value pairs. And in this case, the key is the document name and the value is the document contents. So it could be a big string, right? It may become sort of PDF file or a text file or a web page or whatever. Okay, and so this code is pretty simple. It says, well, for each word w in the input value, so this sort of assumes that somehow you can iterate over all the words in the input value in the text of the document without really specifying how. Then emit a key value pair where the key is this first element, which is the word, and the value is the number one. Okay. Then the magic shuffle phase takes over and groups all the key value pairs that share the same key into a single group. And so all the occurrences of a particular word will show up as a group and how that group is represented is as a key along with what we've called here an iterator over the intermediate values. And if you're not familiar with the term iterator, you can think of this as just a collection of values. Okay, so as an example here, if you have the word, you know, the map function will produce pairs like this. Every time it sees the word history in any document, it'll produce this. And then finally, on the reduced side, you'll have the word history here and a sequence of number ones. Okay, and so what does this code do? Well, it initializes a final result to zero. And it says for each value in this list of intermediate values, add that value to the result. And so here we just add them all up. And then finally we emit a final key value pair, which is the intermediate key, the word itself, and the final result. And so maybe the output here is history 25. And we walked through this a couple of different times. So I'm hoping this is pretty clear by now. Now, I claim that without changing this reduced function at all, you could make a change to this map function and get a significantly faster algorithm for computing this. So I want you to think for a second about how that might be done. So the thing to look at here is that, goodness, we're emitting a key value, well, I'm gonna stop resting my hand on this. We're emitting a key value pair once for every occurrence of a particular word. And each one of those key value pairs has to be shuffled across the network and sent to the reducer. So if we see the word history 25 times in a single document, we're going to emit 25 key value pairs for that word, and they're all gonna get grouped up by the shuffle phase. But we have access to the entire document here in this map function. So why not precount all the occurrences of those words and produce a different key value pair, right? Which means the word history appeared 25 times in this particular document I'm processing. And so now hashing is a 25, sorry, that's confusing. That's confusing. I didn't mean to make the same number as this. This is, in our previous formulation, this problem that turned out that we said that the word history appeared 25 times across all documents, and I shouldn't use the same number up here because that's pretty confusing. So let me change that. So here we say the word history appears five times in this particular document, and it appears other times in other documents, okay? So now we have only one key value pair emerging from this document for the word history, as opposed to five different ones. And overall, across all the documents, across all the computers being applied to this problem, that's a significant savings, okay? And then double check to make sure that you don't have to change this code here, but hopefully it's clear that you don't because you're adding the total value into the result. And so here, instead of adding the number one 25, or sorry, five times, well, sorry, 25 times, I guess, in the reduced side, you're adding it some number fewer times, right? You're adding five plus 10 plus three plus four and so on to get 25, okay? So this loop is evaluated fewer times, okay? So the reason I wanna go through that example is to demonstrate that there's two things. One is to try to think in terms of map reduce and think about how you can cast a problem as operating on a bunch of chunks, emitting keys to define groups and then operating on those groups, but also that you actually have a lot of control over the performance of these algorithms by just modifying the map and reduce functions, right? So even though you aren't working on the system internals, you only have these two points of control, you can actually get very different algorithms, very different behavior and different amounts of intermediate results being created and so on, just with these two functions. And so you wanna get a feel for not just how to express it and map reduce naively, but also get a feel for how to do things reasonably efficiently. And in fact, this example sort of demonstrates one of the things you're gonna be looking for is the bottleneck often, not always, is the amount of data going across the network. And so if you can reduce the amount of output produced by the mappers, especially in terms of number of key value pairs, you'll tend to improve performance, again, not always. And we'll see some more examples of this. Okay, so I'm gonna stop there and then the next segment will go through a variation of this problem that has similar characteristics but really just to drive home how to design these map reduce algorithms on slightly variant problems. Okay, so let's talk about another example that is sort of similar to this very simple word kind of example, but has a slight change. Okay, so now we wanna maybe know the makeup of a corpus of documents, a set of documents, and try to understand the characteristics of the word length. And so now instead of histogram on word usage, we're going to group things by the length of the word. Okay, so we wanna know how many words have greater than 10 characters and so on. All right, so here we might group words into big words, medium, small words, and tiny words where the big words are everything that's 10 plus letters and the medium words are in red here and they're everything from five to nine letters and so on. And you can define your own sort of bucketing scheme or just not even try to bucket them and use the exact number. Okay, so what we're basically showing here, so I'm hoping that some of you are already kind of seeing how this is a essentially trivial variant of what we already just did. And this is one of the points I wanna make is that you'll see these patterns in designing MapReduce algorithms come up again and again. And so once you try to get a feel for how to do it, it won't be that every new problem looks different. But the other things nice about maybe the next few slides I think is that it shows a little bit more detail in how these things are broken up. So for example, if this is a document, what we're here, I guess the point I wanna make is that before we sort of assumed that every document was a small thing, but it's not impossible that you might have one document that's very, very, very large. This typically wouldn't happen with a document for various reasons, but imagine these weren't just documents, but these were just big data sets of words. And so every document itself, an individual document may be so large that it can't be processed by a single map function at a time. And so the question is, are we stuck here? Is the MapReduce framework broken and it's going to crash? And the answer is no. What will happen is when you load this large document into the system supporting MapReduce, and we'll talk a little bit more about what this lower level system is, what I mean by loading the file system underneath MapReduce, underneath implementations of MapReduce, when you do that loading, the file, the data set, the file will automatically be split into chunks. And so we can pretend that this document was so large that it needed to be split into chunks. Okay, and so chunk one is this top part and chunk two is this small part. Now, if a small document is underneath the chunk size, then it won't get split, but a large document will. Okay, and so this could have happened with the word count example too. This is not something specific to word length, obviously, but it's another twist that we're exploring. Okay, so fine. So now this top chunk gets assigned to map task one, and it produces this little histogram of the counts of yellow words, red words, blue words, and pink words. And we can imagine, you know, you should think about how the code might look if you had to write this, right? You would need to take the length, you know, iterate over all the words in the document just like we did before, but now instead of emitting a key being the word itself, you would emit, you would count its length and put in a case statement to figure out what color it is in this notation and add that to a count. Okay, fine. So the output is these four key value pairs. And in chunk two, we do the same thing and produce a different set of four key value pairs. Then in the shuffle step, all the yellow key value pairs will be grouped together, and there's two of them, and all the red ones will be grouped together, and so on. And then in the reduce phase, you'll add these two numbers together to produce 37, okay. So the structure here is really identical to word count. It's just basically a change to the map function. And in fact, in this case, you could actually literally use the exact same reduce function, right? For every key, add up all the contributions of it. You would need to change reduce at all. And this is something else you'll see is that, you know, certain reduce functions are more general than others and you'll reuse them time and again. For example, counting things and adding things up is pretty common in these map reduce algorithms. And so reduce general reduce functions that add things and count things come up time and again. Okay, fine. So word count, generally the canonical place to start when thinking about map reduce, word length, very minor variation on that. So let's think of another sort of minor variation on that. And this one in arguably is even simpler than word count. So here we wanna build an inverted index. And what an inverted index is, when you have a corpus of documents, you can presumably efficiently access any given document by its name, right? So if you wanna look up a URL on the web, you can do so. But for a search engine, you know, a very primitive search engine, you might wanna look up documents that contain a particular word. And so building this index to support word lookup to provide a list of documentaries that contain that word is called an inverted index. And it's one of the primitive steps in doing any kind of text retrieval kind of system, okay? So now imagine we just had tweets instead of documents. And the input here is that the keys are these tweet IDs that I've invented. And the value is the text of the tweet itself. And so the desired output here is the word along with a collection of tweet IDs. Okay. So how do we do this? Well, again, the code is actually simpler than it was before because now on the reduce phase, instead of, well, so we'll think about it for a moment. The mat phase, instead of producing each word pancake in one for an occurrence, we won't do that. Instead, we'll put out pancake and the document ID itself, right? And all of these guys will be produced. And so this tweet one, the map task, the process of this tweet one, we'll put out pancake, tweet one, love, tweet one and so on. Okay. Further, you know, an optimization here is if you see the word, I guess I should have had an example of this in here, but if you see the word pancake twice in the same tweet, do you need to put it out? Do you need to omit the key value pair twice? Probably not to build this index because all you're trying to record is that the tweet contains the word pancake, not that it, not how many times it appears. So fine. So these key value pairs get shuffled across the network. And now the reduce task, what does it do? Well, it's going to get a key pancakes, this shouldn't have an s on it, I guess. And it's going to have an iterator over all the document IDs that contain pancakes, which in this case is tweet one and tweet two. So do we need to do any processing on this key and group of values? Answer in this case is no. The reduce function is complete is non-existent. There's nothing to do, right? What the group is exactly what you want in this case. And this pattern actually shows up somewhat often as well, where you do want the map and you do want the shuffle phase to do the grouping, but all you wanted to do is to perform the grouping. You didn't actually care about the reduce function. So you're not counting these things, you're not adding them up in any way, you're not doing any processing on the tweets, you just want to omit that. And that's perfectly fine, because this group of values is perfectly serviceable as a value itself. So if you are used to say relational databases, nested structures, collections of values within a single cell and a table in a single row are generally disallowed, right? And this is actually first normal form if you're familiar with that. But here we don't care, it's any key in any value. Key can have substructure and a value can have substructure. Okay, fine, so that's how to build an inverted index in MapReduce. Let me stop there and next time we'll talk about this relational join example. So this will be how to implement a join from a relational database as a MapReduce program. So let's look at an example that doesn't involve processing a corpus of documents. So let's think about how to implement a relational, the join operation that we learned from the relational algebra in MapReduce. Okay, so here recall that you're given two relations and a relation is a set of tuples and you're trying to find every record in one relation that corresponds to a record in the other relation, right? And so here we're gonna join on SSN equal to imp SSN. And actually it wasn't quite correct for me to leave this unspecified. There's a notion of a natural join that would match it up if the field names, if they have to be named as matched, but here they don't and so I need to be explicit about what I'm joining on. Okay, and so the output here is these three records. Okay, this record joins with these two and this record joins with this one. Fine, now imagine that both of these relations are huge. Okay, so the map phase here is going to process every tuple in general. And we have a problem right off the bat that join is a binary operation, right? It has a two input relations. The left relation and the right relation, you're trying to find corresponding tuples in one of the matches that match the other. But MapReduce is a unary operation. It processes a single data set. So how can you, you know, at first approximation, you can't express join in MapReduce period, but that's okay. There's a bit of a trick here. And the trick is, look, you know, it's okay to imagine the data set here is just a big jumble of tuples. It doesn't matter what table they came from. We just lump them all together into a single data set called tuples. Okay, and that'll be what we process with MapReduce. Okay, and so here I'm asking the question, you know, what is this for? Well, this is a label that we've attached to every tuple so that we can know where it came from. And we'll see how that's used later. Okay, now I'm not being specific about where you get this label or how you do it in MapReduce, but I want you to think logically that it's necessary. And in practice, it's not that difficult, right? Because you know, for example, if you're processing, you have distributed files from this directory representing all the employee chunks and you've got distributed files in this directory representing the assigned department's chunks, you can look at the file name to tell what table it is. And so in the map function that you write, you can determine that, aha, this has file name, you know, this has the employee relation name and it's in its file path. And I'm gonna go ahead and attach that as part of the record. Okay, and so we could write pseudo code for this. And in fact, you might as part of the upcoming assignment. All right, so fine. So what's the map phrase look like of this relational join? Well, for every record on the input, you're going to produce a key value pair. We know we have to produce key value pairs because that's how MapReduce works. So what's the key gonna be? Well, I'm gonna give you, the key is going to be the join attribute, the attribute that you're joining on. And the value is gonna be sort of everything else, right, the rest of the tuple. And we can actually get away with removing this guy to save some space, but typically me, typically wouldn't bother, okay? So fine, so given a tuple that looks like this with three fields, we produce a key of 7777777 and so on, and the value is the entire tuple and so on. And again, notice that both, the tuples from both relations are all lumped into the same input, all right? Maybe belaboring this, but I wanna make sure that's clear. Okay, so so far we've done two tricks. One is lump everything together and two is produce a key value pair where the key is the join attribute. Now, what happens in the magic shuffle phase? Well, everything with the same key gets lumped together as we talked about. And so now you get a reducer invocation that has one reducer invocation for every unique key. So in this case, it'd be one for the 999 and so on and one for 777 and so on. And the list of values associated with that key will be all the tuples that share the same join key. Now, it doesn't matter what relation they came from, they'll all be in this list. So in this case, we get one tuple from the employee relation and we get one tuple from the department relation and here we get one tuple from the employee relation and we get two tuples from the department relation, right? And so now in the context of a single computer, we have everything we need to compute the join. And further, each one of these reduce invocations can be done on a separate machine, okay? And that's how we scale. So now this reduced function that the program would have to write if you're implementing a join phase would have to take this key and take all these tuples and produce the joined tuple, right? Where these two attributes came from employee and these two attributes came from department. And same thing here, employee department. Now, if you want to think about what kind of operation you need to implement in this reduced function, well, you've got a set of tuples from one relation and you need to associate it with every possible tuple from the other relation, right? Because we know they all join. They're all by definition, they all join. They all have the same key, they all match. So if you have every member of a set paired with every possible member of another set, if you remember, that's a cross-product operation. And so here again, we see relational algebra popping up sort of in a different context, okay? So I don't confuse you too much. The overall operation we're trying to do is implement a parallel join. It happens to be that locally, right here inside one reduced function, we recognize, aha, that's another relational algebra operator, a cross-product. But this is more for abstraction purposes than for algorithm purposes. Just want to point that out, all right? So again, you put on your relational algebra colored glasses and the whole world, you know, these operators start to pop up everywhere, all right? So fine, so let's do this one more time just to make sure it's clear. So I'm giving you two relations, order and line item, and they have these fields, an order ID, an account, and a date. And here there's an order ID, an item ID, and a quantity. And we're gonna join on order ID. So what's the map phase look like? Well, once again, the key will be the join key, in this case, order ID. And the value will be this pair, the relation name along with the original tuple. And before we just sort of lump these together in one tuple, and it doesn't matter very much. Here I've kind of structured it slightly differently, but the point is that all the information is here. The key must be the join key, and the value is the tuple as well as some sort of indicator of what relation it came from. So maybe I'll ask real quick, why do I need that indicator of the relation? Well, in the reduce phase, I wouldn't be able to produce the join tuples properly if I didn't know which tuples went with which relations. If I just had a big bundle of tuples, and I couldn't really figure it out very easily, that would be a problem. Theoretically, you might be able to avoid tagging explicitly with employee and department, because you could deduce that, well, the employee table is the one that has a string first, followed by the join key, and the department relation has the join key followed by some other string, but it's a little dangerous to rely on that kind of information, because you could be joining two tables that have exactly the same schema, and there wouldn't be any real obvious way to figure it out. So an explicit tag is a little bit safer here. All right, fine. So that's the map phase. Process all the, we lump all the records together and produce key value pairs, or the key is the join key from the corresponding relation, join attribute from the corresponding relation, and the value is the entire tuple tagged with the relation name. Okay, and there it is right there on the slide. Okay, so what's the reducer look like? Well, now we've got all the tuples that have to share the same key together, and it will produce these join tuples. It will join this order tuple with both of these line item tuples to produce these two join tuples. Okay, fine. So now let's go on to a different example. And so here, maybe we're gonna get started with analyzing the social network. So here we have a graph where every ad represents a, let's say, a friend relationship, or if you're thinking about Twitter, you could have this be a follows relationship. And so the input here is a set of edges with the semantics of Jim is friends with Sue or follows Sue, and Sue is friends with Jim or follows Jim, and so on. And so one point I'm making here is you'll notice that if Lynn follows, if Lynn points to Joe, then Joe points to Lynn. And so this is, there's a symmetric relationship here, okay? And I've done that for simplicity's sake to sort of avoid the confusion that can result from thinking about undirected graphs versus directed graphs, okay? Fine, so anytime you see an edge going one way, you'll see the other edge coming back. Now, what we wanna do here is, well, before I say what we're gonna do is make sure the task is clear. We're gonna count the friends. We want something very simple. We just wanna say how many friends does Jim have and how many friends does Sue have and so on? And so the desired output here is Jim with three because Jim has Sue as a friend and Jim has Kai as a friend and Jim has Lynn as a friend. And we wanna have Lynn equals two, or Lynn has two friends because Joe and Jim and Kai is just one, right here, Jim, and Joe is just one, which is right here, Lynn, okay? So this should already look like something we've already done before. So it happens to be social network analysis and the records happen to be these pairs of people, but the algorithm you're gonna use should start to stand out to you. Okay. And maybe you'll see why in a second if it's not clear. So in the map phase, how are we gonna do this? Well, for every friend on the left-hand side produce a key value pair where the key is the name of the friend, or the name of the person, and the value is just the number one, indicating that there's exactly one friend associated with that, sorry, exactly. That we've encountered one friend associated with, say, Jim in this case, all right? So this record gets turned into Jim in one and this record gets turned into Sue in one and this record gets turned into Lynn in one and so on, okay? And then, you know, through the magic of map reduce, the shuffle phase produces this intermediate result where we have Jim the key associated with a list of occurrences, one, one and one, and Lynn is associated with two occurrences and so on. And then the reduce phase just simply adds up all these occurrences. And so what does this remind you of? Well, it's a lot like the word count example, right? Instead of documents producing words and counts, it's even simpler. It's for every record to just produce a single count for that person that the record represents, person on the left-hand side, okay? And after that, the reducer is literally exactly the same, right? It'll add up all the occurrences and produce a single number, okay? So in the next segment, we'll talk about something a little trickier, which is implementing matrix multiplication in map reduce. All right, okay, so let's look at a simple matrix multiplication algorithm in map reduce. So before we get there, just to remind you how to think about matrix multiplication, we've got a matrix with four columns and two rows, multiplied by a matrix with two columns and four rows and the output is gonna have the number of rows from the first matrix and the number of columns from the second matrix. So it's two by two in this case, okay? And again, just to refresh your memory here, what is this result? Well, it's the first row of the first relation, you know, the dot product with the first column with the second matrix, right? So it's one dot one plus three dot four plus four dot negative three plus negative two dot zero and that should equal one, okay? And so on. So that's, let's just say row one and column one dotted together gives you this position. Row two and column one gives you this position and so on. All right, so I'm hoping that was intensely boring. All right, so in map reduce, how do we wanna do this? Well, what we're provided here is two matrices representing sort of a sparse matrix format and a sparse matrix format is gonna look like this. So this is row ID, column ID and the value. And the reason I call this a sparse matrix format is that it would be inefficient to represent a very, very large matrix this way if you had a value for every position, right? So if you think about just a multi-dimensional array and memory, you don't have to be explicit about the I and J coordinates. You only have to be, you only have to provide the values. But if many of those values in that array are missing, then in this representation, I just can ignore them all together. I just don't put them in, right? Okay, so anytime a value of zero just remove that tuple all together. Okay, so we're given two of these sparse matrices represented as tuples, you know, sets of tuples and we're gonna do the same trick. Matrix multiply is a binary relation and so we need to lump them all together and we need to tag them with the source and then we need to apply this kind of a trick. In the map phrase, for every element I, J of A, emit several things, emit a tuple where the key equals I comma K and I'll tell you what K is in a second and the value is A, oops, sorry, value equals A, I, J. Now, we're gonna emit one key value pair of this form for every K in one to N, which is what I've written down here. Now, what is N? Well, N is the number of columns in B, in the right hand matrix, right? So A is an L by M matrix and B is an M by N matrix. So what is it saying? For every column of B, emit a tuple with key I to K and value the value at I, J. So I have a diagram on the next slide that explains this, but what you're doing is you're going to replicate this value to every column in B, okay? And then for B, you do the same kind of thing. You say the key is gonna be equal to I, K and the value is equal to B, J, K, okay? And you're gonna emit one of these key value pairs for every I, the upside down A for for all, for all I in one to L, where L is the number of rows in A, right? So you have to replicate the values of B to all the corresponding rows of A and you have to replicate the values of A to all the corresponding columns of B, okay? So this, and then finally in the reduced phase you simply, you can do the dot product and produce the output. So maybe hard to think about written out in notation like that. So think about this sort of diagrammatically. The first, the value I, the value one, one in A needs to be sent, actually let me back up one step. First thing to recognize is that there's gonna be one reducer per output cell in this algorithm. So here we're gonna have six reducers and if you had really, really large matrices which is why we're playing this game is to imagine that we have, you know, 10,000 by 10,000 matrices, sorry, matrix with dimensions 10,000 by 10,000 then this starts to make more sense. So one reducer per cell in the output matrix, okay? And think about what data does it need in order to compute its answer? Well, it needs, for the reducer one, one in the output it needs all the values from row one in A and it needs all the values from B one, sorry, from the first column of B, right? All those need to be sent here. Now, fine, so maybe I'll write that real quick. We need row one from A and we need column one from B, right? Now, let's think about this second position. Okay, so this is row one, column two. Well, here we need row one from A and column two from B, right? So that's fine, but the problem is we don't have data represented in terms of rows and columns in the input, we have every individual cell. So we have to figure out where should this value A11 be sent? Well, it needs to be sent to everybody that might need it, which means it needs to be sent here because we see row one from A, well, this isn't row one from A, therefore it needs to go here and the second position is also involves row one from A. So this guy needs to be sent to two places, which is what I was trying to draw here with these colors. It gets sent, let me draw some more arrows here and see if it gets too cluttered, which I'm sure it will, but we'll try it anyway. It needs to be sent to both of those locations. So for every column of B, you need to have row, this value be sent. And similarly for this guy, right? So to all the reducers that might appear and that need the data for row one, you need to send it to all of them. And so the reason I'm sort of belaboring this is that this is kind of a nice trick that MapReduce can do. Remember, you can replicate, right? You can send a single value out of the mapper to many places. And when I say send to many places, I don't mean literally sort of write it on the wire and send a packet across the network. What I mean is attach a value to multiple keys and then every individual key will go to a different place to a different reducer, okay? And through this trick, we can sort of arrange for matrix multiply to occur. And this arguably scales pretty well, right? We've done some replication, but that replication is kind of necessary and this whole thing can sort of happen in parallel. And then finally, again, each reducer produces the sum ai times bj. Sorry, it produces the dot product of a row of a and a column of b. Okay, so let's talk a little bit about what kind of systems MapReduce is deployed on. So I don't spend too much time on the system internal since this is a data science course not a distributed systems course, but it's good to have an intuition for what's going on under the hood. So there's three types of systems to be aware of or architectures to be aware of. Shared memory, shared disk, and shared nothing. In these diagrams, these cylinders are the disk, these rectangles are the memory, and these circles are the processors, okay? So shared memory means that every processor has access to all of the memory and all of the disk. And this is what you think about when you have sort of a laptop that has four cores and if you have a quad core system or nowadays if you have six or 12 cores in your laptop, this is the model that's being used, the architecture that's being used. Shared disk is somewhat less common in at least in these contexts that we're talking about, although certainly common enough overall where you'll have individual machines all access a shared file system. Now that setup is very common, but for using that setup for parallel analytics is sort of the domain of high end commercial databases. So your oracles and your IBMs will often use a shared disk architecture, okay? And then shared nothing is really what we're focusing on here. And this is what map reduces design for and what increasingly and largely parallel databases are designed for as well, okay? And shared nothing here means individual machines that are only connected by a network, okay? No shared memory, no shared disk, fine. So the argument is that only the shared nothing architecture can scale to sort of thousands of computers and beyond, okay? Because eventually the sharing of memory or the sharing of disk eventually becomes a bottleneck and limits how many computers you can attach to the same device or same logical or physical device. Okay, and so learning how to program these massive shared nothing clusters is what map reduce and parallel databases are all about. And the shared memory of machines are perhaps the easiest to program but are conventionally assumed to be pretty expensive. I should point out that the costs are dropping fairly quickly. So it's getting more and more feasible to buy a pretty beefy machine with lots of main memory and lots of cores. And your problem might fit inside that one. So when you see people deploying Hadoop and MapReduce on fairly small clusters on the order of say 10 nodes, you should see whether the data sizes they're processing are actually all that large, right? It could be that the data size of their processing is something that fits in main memory on a similarly priced amount of hardware, okay? So, fine. So Hadoop and MapReduce are designed for really, really large clusters, okay? All right, so this is the context we're in. A large number of commodity servers connected by a high-speed commodity network. And here, you think about a data center, there's a rack that has a number of servers and there's a data center that has many racks. And this is how you organize your thousands or tens of thousands of computers. Okay, all right. So you're looking for massive scale parallelism that will run jobs that will run for many hours even on thousands or tens of thousands of servers. That's really the context we're in. So when you're in this context, an issue that comes up that does not come up, all that often in a much smaller context is failures, right? So if you're running a job for a long time on thousands of computers, the chance of something going wrong during that job becomes essentially 100% probability, right? There's going to be something that fails. And so your system of processing, of doing this sort of analytics, has to tolerate this kind of failure. You can't roll back to the beginning and just restart every time there's a failure occurs or you would never get anything done, okay? So even if the mean time between failure for say a disk is a year, if you've got 10,000 servers with multiple disks or 10,000 disks, let's say spread across the thousand servers or any combination hereof, you're going to start to have failures once per hour and you can look up the mean times failure and actually do the math, but there's a couple of nice papers out there that I'll try to put in the readings if I remember. Okay. So failures are what we're concerned about here. So that's hardware. Popping up the stack one level is this distributed file system. And you might see HDFS2, which is the Hadoop distributed file system. So you remember the context here was that MapReducer proposed in a paper in 2004 by Google and Hadoop was the implementation of that, of the ideas in that paper, okay? So we can almost use them interchangeably because the actual implementation at Google has certainly evolved since that paper and is not completely known, right? So when people are talking about MapReduc, they're typically talking about Hadoop or other implementations of the programming model that have nothing to do with sort of scale out, but shared nothing MapReducer can be assumed to be synonymous with Hadoop. Okay. So this is a file system for very large files and the idea here is that if you're going to take a single file on your own computer, you can manipulate it as a single unit, but if you're going to take a very, very large file and just, you know, put it on a file system that's on a cluster of the machines, then there has to be some layer of logic that knows how to split that into pieces and put those pieces in different machines and keep track of where they are. And that's what this distributed file system software does. So each file is, you know, as you're uploading this data to the cluster, each file is partitioned into chunks that are, say, typically 64 megabytes, although this is configurable. And so each chunk, and this is critical, is replicated several times, right? So there's not just one copy of the chunk. There might be multiple copies on different machines. Why? Because if one of them goes down, you want to have access to the other chunks, okay? And you want to make sure these are spread across different racks in case the entire rack of computers goes dark, you still have another copy of it. Okay. And so the implementations here are GFS and HDFS. DFS is the concept and the implementations are GFS and HDFS. All right, so here's the phases of MapReduce that's a little bit more detailed than the abstract phases that we were talking about when we were talking about the programming model. Okay, so there's a file split here that's read from HDFS. And remember HDFS means it's replicated. There's, the chunks could be, you know, there's multiple copies of every chunk. And there's a unit of code called the record reader that breaks that chunk into, I'm using chunk and split sort of synonymously. I'm not a big fan of the term split because it's sort of sounds too much like a verb to me. The record reader parses that split or chunk into individual records. Then the programmers, you know, the user's map function is called on that individual record. And then there's a step called Combine that we haven't talked about that I'll talk about in a moment. Next, actually. Okay, and then the output of these phases are written out to local storage on that node, as we said. So then these regions in local storage, one per key, are pulled across the network by the reduce phase. Then all the regions from all the different, all the different map tasks that correspond to the same key are sorted together in parallel. And finally, the user's reduce function can be called to produce whatever output it produces. And then the output of that step is actually written back out to HDFS so that it's replicated. So again, if something goes wrong in the map phase, you have to rerun the mapper. And if something goes wrong in the reduce phase, you have to rerun, you can pull the output from the local storage from the map phase. And if something goes wrong in the overall job, you know you're safe because you don't lose data because the HDFS are replicated. And if you lose a reducer and you lose corresponding mappers, that's fine, you can sort of rerun whatever you need to rerun, fine. So the point is that you're guaranteeing for fault tolerance during job execution. And so let's talk real briefly about the combiner. So to think about why we need a combiner, go back to this word count example that we began with. Well, in each case, we produced a word and just the number one in the earliest version of this. We just produced the number one. Sending all of these occurrences of word one. So if word one appeared twice, then you're gonna get a key value pair with w one and a number one and another occurrence of w one and a number one. And you're gonna send both of these guys across the network to be sorted and parallel and grouped in order to be processed by the reduce. Well, that's sort of wasteful. You'd like to combine these into a single record w one comma two and then just send that because it's smaller. Well, you could rewrite your map function to do this, but it's such a common need that you can, that you have this capability called a combiner. And so a combiner identifies key value pairs with the same key and lumps them together before sending it over to the reduced side. So it just saves the network traffic. In many cases, the combiner function can be literally the same function as the reduced function. It all works out. And what needs to be true for this to work is that the function that you're applying needs to be associative and commutative, but I'm not gonna say much more about that. Okay, so here's what it looks like in pseudo code. We saw the map function earlier and we were emitting this key value pair of a word and a count and we saw the reduced function earlier, but we're adding new as a combiner function that has the same type signature as a reducer. And again, often in many cases can literally be the same implementation as a reducer. And the only point of this is that it's being applied before sending data across the network. Fine. So here's sort of a summary of a Hadoop job that I like a lot and this is from my Huyvo who is now at NYU Poly, I believe. He's still at NYU Poly. So the data begins on HDFS and there are these chunks and the in-part partitions go to map tasks. Now these are again not invocations of the map function, these are entire tasks. And the map, each individual map invocation may produce multiple key value pairs, so regardless the map task almost certainly does, right? So it's gonna produce these local sort of colored key regions and the regions are gonna be pulled across the network to the reduced servers. And here in this example, there's only two reduced servers. Now, before we gave this example, we sort of showed all the blue ones going together and all the red ones going together. That was a bit of a simplification. What actually is gonna happen is that if you only have two servers, well all the hundreds of possible keys need to be mapped to just those two servers. So you're definitely gonna get a mix going to the same place. And this is where we've been a little bit glib up until now. We've said that you specify the key and it hashes to a particular reducer, which is true logically, but before that, you have to actually get it to a machine where lots of reduced tasks might be running or lots of reduced invocations might be running. Okay. And so in this case, the blue guys and the red guys both ended up on the same machine and the green guys and the orange guys both ended up on the same machine. And then this parallel sort manages that, right? So it puts all the red things together and it puts all the blue things together. And then for each individual color, one reduced invocation is called. Okay. And the reduce function is called and it produces the output partition and then that output is then written back out to HTFS. Okay. Let me stop there and I'll pick up here and talk a little bit about parallel databases and how they do query processing with the point being that it looks a lot like MapReduce. Okay, so let's talk a little bit about other large scale data processing systems besides MapReduce. And as a step to that, let's think about the design space of possibilities here. And so this is a breakdown that was proposed by Michael Azard who developed a system called Dryad at Microsoft, which is a very nice system with the same sort of motivations as MapReduce. Okay. And so he divided this space into these three axes where you're worried about sort of low latency, very interactive sort of speeds, quick turnaround time, versus things that maximize kind of throughput, you know, massive batch jobs operating on, you know, thousands and thousands of computers at once. Versus another axis here is sort of whether it's in a private data center where they're scaled out widely across the internet. And then maybe the third axis is data parallel versus the shared memory, okay, that we talked about. So the areas that we're mostly concerned that they're gonna be here, which is what we're talking about currently. And then in a couple of segments, we're gonna talk about these low latency, smaller operations, which you can think of as the NoSQL systems. Okay. And then maybe where Michael placed older relational databases, although he didn't label it the same way I have, is down here in this quadrant where they're mostly in shared memory sort of space and low latency. I say older databases to try to point out that not all relational databases operate in that space. Many are, most in fact are data, most parallel databases are data parallel as well. And so this is, you know, MySQL and PostgreSQL, if you're familiar with those, are probably in this space. Shared memory, shared disk. All right, and here HPC means high performance computing where, you know, it's a private data center and it's a big massive shared memory, but it's a batch job submission system, right? You say you run your big compute intensive job and submit it to the machine and it processes on it and eventually returns to the results. And then I'm not gonna talk too much about this, but this notion of grid computing was really sort of focused on connecting up clusters of computers from different universities and letting them all sort of talk to each other and so that's why he's pushed this up in this other axis. Increasingly, you're seeing these systems also being pushed up in this axis. Internet scale, planetary, you know, distributed hash tables with different kinds of layers for guarantees or in kinds of semantics. So the Spanner system from Google fairly recently is a nice example of this. Okay. So sort of wrap up what we talked about last time. Large scale data processing, you know, many tasks need to process big data and produce big data. And so you wanna use hundreds of thousands of CPUs or and hundreds of thousands or tens of thousands of computers to solve these problems, but this needs to be much easier. And so there are such things as parallel database. We talked about databases and we sort of extolled the virtues of programming in that model and they exist but they're often expensive. Well, they almost exclusively are expensive and they're difficult to set up. And it's actually not totally clear that many of the parallel databases scale to really hundreds or thousands of machines. Okay. And so MapReduce came around at a time as a bit of a response to this scenario. So it's more of a lightweight framework featuring automatic parallelization distribution as we've been talking about, featuring fault tolerance. And I mentioned a couple other things here that perhaps less important but the IO scheduling, the status of monitoring. So really sort of strip everything down to just parallel processing. Not all the features that parallel databases offer just parallel processing with the added benefit of fault tolerance. Okay. And this really seemed to scratch an itch with people. I'll argue here and I'll probably mention this again at some point that it's not totally clear to me that MapReduce would have been quite so popular had there been available a parallel open source relational database product, but all the open source databases were all not parallel. In fact, they were even single threaded for query processing. You know, only a single thread was working on an individual query at a time. But that's speculation. Well, actually I have a little bit of evidence for that but I'll lay out for you in a bit. Okay. So now I wanna talk about, maybe I guess I'm building up to that argument. I wanna talk about parallel databases and how they work and hopefully show that there's some similarities so where there are similarities and where there are differences. Okay. So recall that a key idea of relational databases was this notion of a relational algebra where you could write sort of plans like this and that this top level language called SQL was the most common way of producing a relational algebra plan, right? You wrote a query in SQL and it was automatically turned into a relational algebra plan by the system. Okay. So this is kind of throwing out the window with MapReduce, arguably in favor of sort of flexibility in providing the program with more control. But let's go back to this model for a bit. Fine. So now we wanna evaluate these queries but we wanna do it in parallel now. And so there's two different terms that I want you to be familiar with. One is distributed query and one is parallel query or distributed query processing and parallel query processing. And so they're both ways of sort of taking advantage of more computing resources for the same query but they behave a little differently. So in a distributed query, what you're doing is taking a single large table and distributing it across a cluster just like we talked about. And then you're breaking your query into individual pieces to operate on each of those partitions of the file. Okay. So this sounds like well, it's not basically just the same thing as MapReduce. It is except for the fact that all the results of those individual pieces are all sent back to the head node, to a single server to sort of finish processing. So for example, if you're doing a account, right? You wanna count all the records that match some criteria. Well, if you have a very large file that's split across several machines, these distributed query systems, Microsoft SQL Server in particular is an example of this, is smart enough to break the query into a bunch of little pieces and run each of those pieces in parallel. But as they start to stream tuples out to be counted, they'll send them all back to the head node. Actually, I guess that may not actually be true because it may be smart enough to figure out that you can just count things in parallel and add them up. But it's not hard to construct a query where you have this bottleneck of sending everything back to a single server. So it's essentially, you can think of it as having the map phase but not really the reduce phase. Now, parallel query, every individual operator in the relational algebra is implemented in parallel. So when you're doing joins, you're doing joins across a bunch of nodes. When you're doing groupings, you're doing groupings across a bunch of nodes. And we've seen how to implement relational join in MapReduce. And it's not too far off from how it's actually implemented inside databases. Okay, so if we know how to implement join, that's usually the harder one. Trust me that you can implement the other ones that way. Well, now we have a way to do parallel query processing with the relational algebra. So why not do that? Well, the answer is people do do that. And we'll come back to that in one second. Okay, so for a distributed query, I guess I was waving my hands a second ago trying to explain this when it was on the next slide. You can imagine constructing a view. And we've talked about views. If you don't recall what that is, it's a named query that can be then be accessed as a single table. So we say that the sales table is really the union of a bunch of smaller sales tables, one for each month. And in particular, you could put each one of these sales tables on a different disk or even a different server altogether, right? And then the user who's querying the sales table doesn't have to care about the fact that this is actually distributed, distributed table. They don't have to go gather up all the results from January and then gather up all the results from February and then gather up all the results from March and put them all together. That's done automatically by the system. However, right, and so this is the create table statement that we didn't talk about for constructing the sales table for individual March. Okay, but again, however, when you process this stuff in parallel, that works great, but when you get the results, you need to send them all back to the single node to finish processing. And that's the limitation of distributed query. So it's great that you get some parallelism, but it can't do everything in parallel and you can see this when you run performance experiments. But a true parallel query example would be, for example, from a system called Teradata, which is a database company that many folks haven't heard of because they've been selling very, very high-end databases to very, very high-end customers and so they don't sort of need to have much word of mouth in the popular news media. But what's happening here is that as every individual row is inserted into the parallel database, it will be assigned to some particular server using a hash function. Okay, so everything is automatically partitioned, more or less randomly across the cluster. And then whenever you're running queries on this, all the machines will access their data in parallel. Okay, and you can see this has a little bit of the flavor of how we did the relational join. In MapReduce, and that should become even more clear in a second. Okay, so remember, this is our query, orders and line items, and this is the plan that we're gonna do, we're gonna select some orders and then join the orders with the items. So how this starts is these parallel processing units called amps, in Teradata terms, will each contain a piece of the data, a chunk of the data, and the chunk was defined sort of randomly by hashing, okay. And so they all in parallel begin to scan their individual chunk, and then they'll all in parallel apply the filtering condition to throw out certain records that they don't want. And then they'll all in parallel hash on the, actually this isn't right, this should be hash on the order ID, not the item ID. So this is the join key, right, we're gonna join on order, and I probably got this wrong back here too. Yeah, this is wrong as well, this should be joined on order ID, doesn't quite make sense to call this item. Same thing here. Okay, so then they'll all in parallel hash on the join attribute, the order ID, and that will shuffle it, for lack of a better term, to another set of amps, perhaps the same set of amps, but typically another set of amps that will do the next step, actually compute the join. And so this should look a lot like a map reduce job, right, you've got a map function that's scanning and selecting, and then actually then hashing, and then we've got a reduce function coming up to actually produce the join. And for the other relation, the same thing happens, you scan the items and then hash on, again this is order, order, order. All right, so you're scanning the order, scanning and selecting on the orders, and just scanning on the items, and then both are hashed on the appropriate join attribute, and lo and behold, all the items and all the orders that correspond to the same order ID to the same join attribute, same join key, end up on the same machine and you can actually process the join, just like the map reduce example we saw. So then at the end of these two steps that I've shown you, amp four will have all the orders and all the line items where hash of order, goodness, equals one, and this amp five will have all the orders and line items where hash of order equals two, and this one will have all the orders and line items where hash of order equals three, and now it has enough information to individually and in parallel finish the join and actually produce the result, and all these other orders as well. So fine, so the point is that the same machinery already exists in these parallel databases, and in map reduce, if you're interested in doing a join, you're sort of implementing this yourself. And so this observation was not lost on people that hey, it might be nice if there was sort of a standard way of doing join in map reduce and we didn't have to sort of rewrite it ourselves every time, and in fact, there's libraries on top of map reduce that do this, and so there's a library called pig from Yahoo that encourages you to check out that is recognizably relational algebra, right? There are operators called join, there are operators called group by. It does have a bit of a funny data model where you're allowed to have kind of complicated nesting as opposed to just straight tuples and straight relations, but the relational algebra is there. And in fact, this is sort of one of the points I wanna make is that it's important to sort of be able to modularize the concepts that come out of various communities and especially databases. It tends to be true that it's kind of all or nothing. If you're interested in using databases, well then you have to take everything, you have to take the whole package, it's all or nothing. But increasingly what you're finding is that these concepts are leaking out into other systems, why I'm really emphasizing this relational algebra piece a lot is that you can use these concepts independently of buying in to a strict relational model, okay? And certainly not a strict adherence to a particular implementation of it. Now, another system called Hive is literally SQL on top of Hadoop, so it was one step even higher. Instead of just stopping at the relational algebra level, it actually provides a SQL interface. Impala is a more recent system from Cloudera, I should mention here. Cloudera, by the way, is a company that has aligned themselves pretty closely with the Hadoop stack, and so they have their own fork of the Hadoop system and a bunch of great tools for working with that ecosystem. And Impala is a new system that they produce that provides SQL over HDFS and actually uses a lot of the code from the Hive system, okay? Cascading is another system that's maybe a little bit less common, but it's also very recognized of your relational algebra. The Dryad system I mentioned has nothing to do with MapReduce directly except for a similar motivation, but it very obviously has relational algebra there. The Cloudera system I mentioned, it's a more of a research project, and it's not clear to me that the code is available, but it's also very clearly relational algebra. So when you put your relational algebra goggles on, you start to see the world in this way, and it starts to come up everywhere, okay? So it's good to go back and understand those operations. Okay, so I wanna spend a little bit more time on the details of MapReduce versus relational databases beyond just the, sort of how the query processing happens. We saw that parallel query processing is largely the same. Some of the, many of the algorithms are sort of shared between them. There's a ton of details here that I'm not gonna have time to go over, but the takeaway is that the basic strategy for performing parallel processing is the same between them. But there's other features that relational databases have, and I've listed some of them here. So we've mentioned decorative query languages, and we've mentioned that those start to show up in Pig and especially Hive. Now there's a notion of a schema in a relational database that we didn't talk too much about, but this is a structure on your data that is enforced at the time of data being presented to the system. So any data that does not conform to the schema can be rejected automatically by the database. This is a pretty good idea because it helps keep your data clean. It's not real feasible in many contexts because the data's fundamentally dirty, and so saying that you have to clean it up before you're allowed to process it, just isn't gonna fly. And so this is one of the reasons why MapReduce is attractive is it doesn't require that you enforce a schema before you're allowed to work with the data. However, it doesn't mean that schemas are a bad idea when they're available. And in fact, really, even with MapReduce, a schema's really there. It's just that it's hidden inside the application. So when you read a record, you're assuming that the first element in the record is gonna be an integer, and the second record is gonna be a date, and the third record is gonna be a string. So that schema is really present. It's just present in your code as opposed to push down into the system itself. And there's a lot of great empirical evidence over the years that suggests it's better to push it down into the data itself when and where possible. And in fact, you're starting to see this. So Hive and Pig, again, have some notion of schema as does Drive Link, as does some emerging systems. There's a system called Adapt that I won't talk about really at all, but combines sort of Hadoop-level query processing for parallelism, and then on the individual nodes, there's a relational database operating. And one of the reasons among many is to have access to schema constraints. Fine, logical data independence. This actually you don't see quite so much. So this is the notion of views, right? Does the system support views or not? And you haven't seen quite as many instances of Hadoop-like systems that support views, but I predict they'll be coming. Indexing is another one. So we talked about how to make things scalable, that one way to do it is to derive these indexes to support sort of logarithmic time access to data. That's not available in vanilla MapReduce. Every time you write a MapReduce job, you're gonna touch every single record on the input. You're not gonna be able to zoom right in to a particular record of interest. That's wasteful, and it was recognized to be wasteful. And so one of the solutions, you see people adding indexing features to Hadoop. And HBase is an open source implementation of another proposal by Google for a system called Bigtable that, among other things, provides kind of quick access to individual records. And HBase is designed to be sort of compatible with Hadoop. And so now you can design your system to get the best of both worlds. Okay, so you can get some indexing along with your MapReduce style programming interface. And once again, I'll mention Adapt here as well. One of the motivations for Adapt is to be able to provide indexing on the individual nodes. Okay, fine. So I'll skip cache and materialize views. So this is the same thing as logical data dependence except you can actually pre-generate the views as opposed to evaluate them all at runtime, but we're not too much about that. And then transactions, which I'll talk about in a couple of segments in the context of NoSQL. But while databases are very good at transactions, they were thrown out the window among other things in this kind of context of MapReduce and NoSQL, and they're starting to come back. But remember, what MapReduce did provide was very, very high scalability. So this is thousands and up, a thousand machines and up. And it also provided this notion of fault tolerance. So relational databases didn't really treat fault tolerance this way. They were unbelievably good at recovery, right? If you were, because of this notion of transactions, if you were sort of operating on the database and everything went kaput, given some time, it would figure everything out and recover. And you can be guaranteed to have lost no data, okay? That's fine, but that's not the same thing as saying during query processing, while a single query is running, what if something goes wrong? Do I always have to start back over from the beginning or not? And it's sort of the implicit assumption with relational databases was that your queries aren't taking long enough for that to really matter. But in the era of big data, of massive data analytics, of course you have queries that are running for many, many hours, right? And so having to restart those, and of course they're running on many, many machines where failures are bound to happen. And so that context is something that MapReduce sort of really motivated. And now you see modern parallel databases capturing some notion of fault tolerance in general, okay? So this is sort of a list of some, partial list of contributions for relational databases and this is a partial list of contributions, well, maybe a complete list of contributions for MapReduce. And my point is that you see a lot of mixing and matching going on. The design space is being more fully explored. It used to be sort of all about relational databases with their choice in the design space and then MapReduce kind of rebooted that a little bit and now you see kind of a more fluid mix. People sort of cheer in picking features. Okay, fine. And then the last one I guess I didn't talk about here is what I think was really, really powerful about MapReduce is it turned the army of Java programmers that are out there into distributed systems programmers. A mere mortal Java programmer could all of a sudden be productive processing hundreds of terabytes without necessarily having to learn anything about distributed systems. That was really, really powerful. The analog in databases was, I mean, you had to become a database expert to be able to use these things, okay? And so I think that impact is hard to overstate, right? The ability for one person to get work done that used to be require a massive team in six months of work was significant. All right. Okay, so just to wrap up this discussion of MapReduce versus databases, I wanna go over some results from a paper in 2009 that's on the reading list where they directly compared a Hadoop and a couple of different databases and see if we can maybe explain what some of these results tell us. Okay, so this was Andy Pablo and some other folks at MIT in Brown who did an experiment with this kind of a setup. So the comparison was between three systems. Hadoop, Vertica, which was a column oriented database and DBMSX, which shall remain unnamed although you might be able to figure it out. And so we haven't learned what a column oriented database is and what a row oriented database is, but we may have a guest lecture later that we'll describe that in more detail. But for right now, for the purposes, just think of these as two different kinds of relational database or two different relational databases with different techniques under the hoods, under the hood. Okay, and so there's two different facets to the analysis, one was sort of qualitative about their discussion around the programming model and the ease of setup and so on and the other was quantitative, which was performance experiments for particular types of queries. Okay, so the first task they considered was what they call a grep task. And so this is a task to find a three-byte pattern in a hundred-byte record. And the dataset was a very, very large set of hundred-byte records. Okay, so this task was performed in the original MapReduce paper in 2004, which makes it a good candidate for a benchmark. And so the dataset here is 10 billion records with totally one terabyte spread across either 2550 or 100 nodes. Okay, so you're just trying to find this record. So this is much like this genetic sequence DNA search task that we described as a motivating example for sort of describing scalability. Okay, fine. So what were the results? Just to load this data in, this is what the story sort of looked like. Hey, Dup and the system called Vertica that they're, really the theme here is they were the designers of the Vertica system. And so most of these results are gonna show Vertica doing quite well for a variety of reasons. So we're not gonna talk too much about those particular reasons. We're mostly gonna be thinking about DBMSX, which is a conventional relational database and Hey, Dup. Okay. So here loading is fast on Hey, Dup while loading is slow on the relational database. And again, it was sort of fast on Vertica as well. So why is it faster on Hey, Dup? Well, there's not much to the loading, right? You have to put it into this HDFS system. So it needs to be partitioned, but that's about it. When you put things into a database, it's actually recasting the data from its raw form into internal structures in the database. And that takes time, okay? And the process could be even worse because if you're building indexes over the data, you actually, every time you insert data into the index, it needs to sort of maintain that data structure, okay? And so load times are known to be bad. So the takeaway here is remember that load times are typically bad in relational databases relative to Hey, Dup because it has to do more work. Now, once it's in the database, you actually get some benefit from that and we'll see that in a second even in these results. But it's actually, you know, we know that it conforms to a schema, for example. Hey, Dup, it's just a pile of bits. We don't know anything until we actually run a MapReduce task on it, okay? And so how much faster, well, in their experiments for the on 25 machines, you know, we're up here at 25,000, these are all seconds, by the way, you know, 7500 seconds versus 25,000 and a little bit less as we go up to more servers, okay? Now, actually running the grep task to find things, this is what we see. Again, maybe ignoring Vertica for now because I haven't explained to you what, you know, what the difference about Vertica is that it allows it to be so fast. But just think about a relational database from what we do understand. Hey, Dup is slower here. And the primary reason is that it doesn't have access to a index to search. So again, no index is available. Hey, Dup has to do a, okay. So Hey, Dup is slower than the database even though both are doing a full scan of the data. The grep task here is not something amenable to any sort of indexing. You actually have to touch every record. So there's no fundamental reason why the database should be slower or faster. But partially because it gets a win out of the structured internal representation of the data and doesn't have to reparse the raw data from disk like Hey, Dup does. And so I admit that there's no fundamental reason. There is a fundamental reason because it's already in a kind of a packed internal binary representation which we paid for in the loading phase but now we get the benefit from here in the query phase even before we even talk about indexes. Okay. Now, a selection task where you're not having to scan every record necessarily. That is amenable to indexing as we discussed in the scalability segment. Well, the story is even more extreme, right? The Hey Dup results are just way, way, way higher than both the database and in particular the Vertica results. And so here the reason is because you can build an index on the pay drink attribute and to zoom in directly to the records that you're interested in. Okay. So those are sort of search and retrieval tasks, not arguably not exactly what Hey Dup was designed for. Hey Dup was designed more for analytical tasks. So now consider these. So here the data set is 600,000 HTML documents which works out to be about six gigabytes of data per node. Along with another data set is a 155 million user visit records and 18 million rankings records. So this is kind of a web data processing task. And so a simple aggregate task here is to add up all the ad revenue corresponding to a particular subdomain. So they apply this function substring to the source IP to pull out the first six characters, right? The subnet mask of the first prefix of the IP address. Group by that and just add up the ad revenue. Okay, so this is one thing to point out is it's actually very nicely and simply expressed as a SQL query. You know, you don't necessarily have to write a bunch of Java code and map reduce to express it in this particular case. Okay. And so here are the results. Again, you see a row oriented database beating Hey Dup. Hey Dup. And the reason here is maybe not quite so easy to explain, but essentially it's the internal representation in the database. Again, wins credit. There's no parsing that has to happen. Okay. On the same schema, there's a join task, which is to find the source IP that generated the most ad revenue along with its average page rank. And so this is kind of a complicated thing involving multiple passes over the data to sort of compute the average page rank and then find the maximum ad revenue, find that corresponding source IP and then compute its average page rank. And so the implementations here are a fairly complex SQL statement involving the use of temporary tables. And in map reduce, it has to be three separate map reduce jobs chained together. Okay. And so for the complicated SQL, we won't go through this in too much detail, but just notice that there's a join and then there's a group buy. You know, we looked at some complicated SQL which we're gonna show you how to break them down and this is no different. So you see two tables which you should think yourself join and then you see a group buy. And so those are really the two tasks going on. And then the second step is to do a big sort because you see the order buy and just find the top most record. Fine, a join and a group buy. And so here the results are also pretty impressive. And the reason is again because of the indexing, right? This join can be done very, very quickly because there's different kinds of ways to do the join. The one we described for map reduce is when you have no information about the schema. All you've got are these two big relations and you have to scan them both in parallel and shuffle them across the network with respect to the join key and then perform the join. But if one of them is indexed on that join attribute, you have other plans available to you and the database is automatically gonna figure out the right one thanks to the magic of relational algebra. And so that's what's going on here. And so both Vertica and the relational database can do a lot better, okay? Now, so that's fine. So that sort of paints the picture that maybe relational databases are great and boy this map reduce framework is all wet and why would anyone use it? Well, we talked about fault tolerance. But a couple of other things, there's other ways to avoid sequential scans that you can actually implement directly in Hadoop. So for example, if you have a large relation and a small relation, one thing you can, and the relation is small in the sense that it fits on a single node, it doesn't need to be partitioned anymore, which happens a fair amount. You can actually broadcast that, make a copy of it and send it to every machine in the cluster, at least every machine that has a copy of the other relation you're joining against. So you're joining R and S and S is small and R is big. We'll just copy S to every partition of R and now you can do the join locally without having to do this sort of shuffle phase. Okay, so they didn't get a chance to take advantage of that mechanism. Moreover, there are, especially in modern systems, this paper was in 2009, which is now a little bit old, or quite a bit old, there are ways to provide indexing capabilities in a Hadoop stack. And so you're not sort of dead in the water when you aren't allowed to use indexing. So the positive view, if you look sort of warmly on this work, you can think, great, relational databases have all these benefits and Hadoop can't really compete on some of these even very basic queries. Another way of looking at this as well, these tricks that we already know work really well, like indexing do indeed work. And so all we gotta do is add those to Hadoop and we'll get the same kind of benefits. Okay, so what's interesting here is to read about the response from Google when this paper came out, which was a discussion published in CACM. And one of their points was that the largest known database installations were both at eBay at the time, which was a green plum on about a hundred nodes and a Teradata system on also about a hundred nodes. And the largest map reduce installations at the time were way, way, way larger, right? Nearly 4,000 at Yahoo and 600 plus at Facebook. And again, this is years ago. So these numbers are much higher actually in both cases. But I think the overall point is still the same. The size of even perhaps typical Hadoop cluster is pretty enormous. Okay. To conclude the comparison, we've said this a couple of times, but just to wrap it up one more time, what can map reduce learn from databases in the words of the authors of this paper is that declarative languages are a good thing, schemas are important. And what can databases learn from map reduce is this query level fault tolerance, support for what I'm calling this in situ data, which is data as it lies, right? Supporting you without, don't require that the data be sort of transformed and loaded before you can work with it. And then maybe embrace open source. Because if they had, again, if there had been an open source parallel database available, you might not see the same popularity in map reduce. Okay. Other systems that have been considered in the same kind of framework after the fact where HadoopDB, which became Hedapt, which I mentioned is now a startup. And this is Hadoop file system, but has post, it's actually not Postgres anymore. It's Monadb, but the point being a relational database on individual nodes in order to get some of the indexing and some of the at least local level benefits of relational query optimization. And then Hive also came out since then. Fine. So that's the end of both map reduce itself and map reduce compared to relational databases. And in the next segment, we'll talk about NoSQL systems that are solving a slightly different problem. Okay, so the next few segments, I wanna spend talking about NoSQL. And so these systems are typically associated with building very large scalable web applications as opposed to analyzing data, which is really the focus of this course. However, I think it's important to cover this topic for a few reasons. One is, as a data scientist, you'll be manipulating data that is increasingly found in one of these variants of a NoSQL system. But also the systems and the terminology in this space are really influencing people's thinking about how to deal with large-scale data. And so being cognizant of the landscape here and being cognizant of the major trends in the history allow you to make informed decisions about this. You know, as a data scientist, you may be asked to make recommendations about what kind of platform to go with to do your analysis. And so understanding the pros and cons of some of the systems and how they work can be pretty important, okay? You know, and then maybe third, the same concepts that we've been discussing in other segments, you know, sort of relational algebra, logical data independence, you know, simple analytics going a long way, simple scalable analytics indexing. These ideas come up in this space as well. And so you have another application of these concepts. And then maybe finally, the data scientist may be called upon to actually build some of the large-scale systems as part of their work. That's not unheard of. You know, I mentioned in the first few segments that building data products was an important part of data science. And so one at one manifestation of data products is some of these large-scale web applications, okay? And so no SQL systems may or may not be a part of that. Fine, so let's get started. All right, so where we are so far is we've mentioned that data science maybe has these three steps, data preparation at large scale. All right, and we talked about manipulation and munging and data jujitsu and so on. That's all this step one. And then step two is analytics or actually running the model. And we're gonna get that in the next week. And then third is communicating the results, interpreting and communicating the results. And this will be, visualization will be a big part of the way we talk about communication. Okay, so we're still in this data preparation phase. And we're spending a little bit longer on that than say a third of the course, more than a third of the course for the reason that I gave in the first few segments that really this is the part that keeps people up at night. So I wanna make sure that you're armed with how to use databases to do this data munging, how to use MapReduce to do this data munging. Also make the point that a lot of times even the analytics itself can be pushed into these systems which you saw in hopefully in the assignment too in the database assignment, okay? So then some of the key ideas from databases that we took away were this concept of the relational algebra comes up even outside of databases. It's not only found with SQL systems. And we'll see that again. And this notion of physical logical data independence comes up over and over again. Maybe indexing. And then we talked about MapReduce and we gave a lot of examples of some basic operations of MapReduce and we started an assignment involving writing your own MapReduce programs at least at the programming model level. And so here we saw that part of the advantages of the system itself, first at Google and then in the implementation, hey dupe was fault tolerance at scale. And another one was you didn't have to load the data. You could just work with it as is, unlike databases where you have to actually impose some sort of a scheme on it in order to even touch it for the first time. And so that first step can be a doozy, okay? Extending this point, you're a direct programming on in-situ data. So anything that comes at you, you can sort of, if you can write a program that can process it, you can probably write a MapReduce program that can process at that scale, right? And that's a very, very powerful thing. All right, maybe another way to put this is sort of single developer, right? You're kind of up and running within the hour with MapReduce. And that was never really a property that databases had. It was always a sort of difficult, it was a significant project to get a database installed and running. Fine. So that's the background, but what we haven't talked about is this no sequels system. And so I wanna use this table as a way to sort of organize the roadmap for this discussion. And so what I've done here is tried to list out by feature a bunch of relevant systems in this space. And right now they're sort of sorted by time. And so these features are admittedly selected by me for what the important ones are, but I don't think they'd be too controversial and I don't think there's anything obviously missing here either. Okay, so going through these briefly. And a major one here was that it needs to scale to sort of thousands of machines. And there's some need to look up by a primary index. What I mean by that is by some key value, right? You can look up by some record identifier. Okay. And then another feature that they may or may not have is the look up by secondary indexes. So what I mean here is you can look up by some attribute that is not that key. Okay, and so databases have this, for example. You can build an index on any attribute you want and the optimizer will take advantage of it. The third one, and this is the one we'll spend a lot of time on because it's sort of pretty fundamental to the motivation for why node SQL systems sort of earn their own moniker or are different, are transactions. Okay, and you can also see that there's some complexity here that I'll try to explain as we go through this. So it's not just a guess or a no, it's kind of depends on the case by case. Okay. And then this field is whether or not the system support essentially joins but I've generalized that to analytics and I'm perhaps guilty of making those two things almost synonymous. If you can do joins then there's a whole space of different kind of analytics things you can do and if you can't do joins then you're leaving all that work up to the client and all you can do is retrieve data. And so that's really a key indicator of how much computation you can push into the system and how much you have to sort of bring the data back out to the client. Okay. And then integrity constraints. And I debated about calling this schema as the integrity constraints but as I think we'll see arguably some of these systems do indeed have a schema but may or may not actually enforce that schema. And so it's just to avoid the confusion I'm gonna call this integrity constraints. So this is sort of hard schemas if you will. Okay. And then views which if you remember is I'm declaring that to be synonymous with logical data independence. And then finally is there some sort of decorative language or algebraic way of programming against this thing or is it really just sort of a low level simple operation API. Okay. All right, so a couple of caveats here. One is I haven't included any parallel databases on this list at all, although there's absolutely no reason why you couldn't include them here. They would tend to have a lot of check marks across this row but since the focus is no SQL systems I am leaving those out. And then the other caveat I make is that individual cells in this table may be debatable depending on how you interpret the column. So this isn't necessarily meant to be hard and fast rules but again I don't think they'll be wildly controversial either. Okay. So one of the first stories you can tell by staring at this table is that relational databases I've been around for quite a while and have check marks sort of everywhere have all these features except they weren't really ever shown to scale to lots and lots and lots of computers, right? Everything was sort of in the order of tens of machines. Okay, so why don't they scale? Well we talked in the MapReduce segment about sort of read performance and related and sort of analyze some experimental results from a paper in 2009 comparing sort of the benefits and strengths of MapReduce versus databases. So maybe on read performance there's an argument to be made that they did scale but one area where they certainly didn't or at least certainly weren't shown to scale to this level was in updates. Okay, so not just the read and not just the analytic workload but the transaction processing workloads. All right. Okay, so we start with the same schematic that we were looking at when we were talking about MapReduce and scalability where we take a big data set and break it into chunks and send those chunks to different machines. Okay. And here we are replicating this chunk to three different machines which is the same thing we did for the Hadoop file system for fault tolerance purposes where if this machine dies we still have two copies of the data to draw from and we do this with every chunk. All right. But the two questions, the two requirements we need to speak to here is we need to ensure high availability so that when something goes wrong the data is still available. And we also want to support updates in this context which is different than we were talking about before. So instead of just read performance or fault tolerance in the context of reads we also want to make changes to this data now and have them propagate to both other replicas and in some cases to other consumers of that change. Right, there might be other blocks of data that are referred to the same information and I'll give you an example in the next slide. Okay. So imagine a social networking application where people are updating their status and their friends get to, you know, your friends get to see your status updates. Okay. And so the right operation here is Sue updates her own status. And the question we ask is of her friends, what happens? Who sees the new one? Who sees the old one? How does this status change propagate? And the answer to this question from a database perspective was, well, look, you know, everyone must see the new change or no one does, right? Either the transaction commits and all copies of the data everywhere are synchronized simultaneously and further anybody attempting to read the value in a new media state is able to either read only the old value or is, or has to wait until the transaction commits, right? Which could be an arbitrary, a pretty long time. Deadlocks can happen, is why I said arbitrarily. Fine, so that's the answer given by databases. Everything's synchronous. Everything must be updated. It's either all or nothing. And the NoSQL system sort of made this observation that said, well, look, for really large applications, we simply can't afford to wait arbitrarily long for this to happen, right? I mean, you need status updates to be able to commit and respond so the user can go on to do other things, right? They can't sort of look at an hourglass while the synchronization is still occurring. You know, and then further the observation is, well, maybe it doesn't matter anyway. I mean, is it really that important that, you know, here, if Sue's friend Joe sees the new status while Kai still sees the old status, maybe who cares, right? As long as Kai eventually sees the new status, maybe that's good enough, okay? And so these observations suggested moving in a different area of the design space in sort of high scalability, high availability, and consistency, application consistency. And that motivated, and that space of system started to be associated with kind of anti-database, right? It took a very different approach than databases did, and so the term noSQL came into play. It's actually unfortunate that the, you know, the name that stuck was noSQL because it doesn't have all that much to do with SQL, right? It has more to do with the transaction processing side of databases, which is not all that relevant for SQL, right? I mean, the model of transactions, the sequences of reads and writes, nothing to do with the query language, but hey, that's what stuck. Now, I don't mean to say that the term noSQL only suggests these transaction models, it also sort of suggests a weaker data model and so on, and we'll talk a little bit about that, but I want this point to come across because this is one of the key ideas, okay? How did databases solve this problem, or why did they take so long? Well, there's a protocol called Two-Phase Commit that's fairly standard in these situations for synchronous processing, and so the motivation for why you want Two-Phase Commit goes like this. If you want to have a bunch of replicas or other kinds of subordinates, anybody that wants to see the change, you know, you make your status update and your friends need to see it, the servers holding those different friends need to be told of the change, and so if you just go ahead and tell them, say, look, I made this change, go ahead and update your internal state to reflect Sue's new status, then you can have, you know, some of them report back success, but one of them could fail, but now you're in trouble, right? Because this one has the old value because it failed for some reason. Either you didn't hear back from the server at all, or it said, look, something's wrong with my disk, I can't do it, so it responds with a failure, regardless. And these two have already successfully applied the, I'm gonna put a check mark, have already successfully applied the transaction, and so now you're in an inconsistent state. Subordinate three has the old value, and these guys have the new values, okay? So how you solve this problem is two-phase commit, and so the first phase here is the coordinator sends a prepare to commit message, and the subordinates make sure they take action to make sure that they can commit that transaction when asked, no matter what. And so typically this means writing to a log the information related to the transaction so that even if the power goes out when they wake back up, they can pull it from the log. Okay, and the subordinates reply with a yes, I'm ready to commit. And then in phase two, if all the subordinates say they're ready, then you'll go ahead and send the commit message. And if anyone failed, if rather instead anyone failed, then you send back an abort message and the individuals can clean up. Okay, so this is fine. And here's a schematic of it. In step one, they say prepare. These guys all write ahead to the log and say, I'm about to write this, or I'm going to commit this transaction. They respond with yes, I'm ready to do so. The coordinator comes back with commit and then finally all the work is done. And I'm not going to show the schematic for what happens in a failure, but essentially the coordinator needs to watch out for it and send back an abort if something has gone wrong. Okay, so there's a couple of problems with this. One is there's some dependencies on the coordinator here that if the coordinator fails at the wrong time, things can go kind of screwy. And a fully distributed protocol for ensuring mutual commitment of transactions or other kinds of operations can be achieved. And one of the most successful and popular methods of doing this is an algorithm called Paxos that we're not going to talk about in detail. But you're going to see that term if you look at some of the readings for the NoSQL systems. Okay, so think two phase commit on a local cluster for a database. Think Paxos for a distributed sort of peer-to-peer kind of protocol. And just briefly, what Paxos is essentially doing is it's a voting scheme. So people sort of vote on, you know, the individual servers will have to self-determine whether or not they're supposed to commit the transaction or not. And the details can get a little bit subtle. But overall it's pretty simple given the nature, given the difficulty of the task involved. Okay, so fine, so that's one problem. The other problem is just what that Paxos sort of shares is that this can take a while, right? If subordinates don't respond promptly, you might be waiting around. If things fail multiple times, you need to sort of abort and retry transactions of the application layer. Things can go slow when there's, it doesn't necessarily scale when there's thousands or millions of subordinates who need to do this, you're kind of dead in the water. So other protocols that I'm not going to talk about in too much detail include multiversion, but you will see in some of the papers mentioned, multiversion concurrency control where each right creates a new version of the data item and the legality of a read is determined by checking the time stamp of the read transaction versus the current time stamp of the version that you're trying to read. Okay, and if it's been updated since the time you're supposed to be reading it, then, you know, prior to NVCC, all you could do is abort the read and say, look, you're looking at dirty data, you're done. But with multiversion concurrency control, you can actually keep multiple versions around and redirect the read to the, potentially to the prior version that is correct. Okay, and thereby avoiding certain transactions. Fine, so that mechanism still has a dependency on a coordinator role to administer timestamps. A fully distributed scheme where the decision to go forward to the transaction or to abort a transaction is made through a voting scheme among peers is Paxos. And Paxos is very successful and very widely applied and you'll see it mentioned some of the NoSQL papers if you take the time to read them and they're on the reading list. And so this relieves the dependency on having a central coordinator, but is still synchronous and still has the potential for deadlock and can take some amount of time to reach consensus depending on what's going on and what kind of failures are happening. And so it's difficult to guarantee very high performance of very low latency response times. All right, so then the term eventual consistency was originally defined not so much in the context of its utility and allowing systems to scale to very large levels, but just in this argument that the right, the only players in distributed systems that could make the appropriate decision about how to handle conflicts were the applications themselves. And so it was a version of this end-to-end argument that you may or may not have come across in the context of networking. And so this was a paper in 1995 by Doug Terry where this term was coined. And so he says, you know, we believe that applications must be aware that they may have read weekly consistent data and that their right operations may conflict with those of other users and applications. And that applications must be involved in the detection and resolution of conflicts and say it's naturally dependent on the semantics of the application. And so we'll make the argument in a few couple of segments that I'm not sure I totally agree with these assertions that it actually is better for the system to take care of this when it can. But what I wanted to do is let you know that this is where the term comes from as opposed to the no-sequel systems in the last 10 years or so, which is really where it increased in popularity. So what does it mean? Well, what it means is in the absence of updates, all replicas will eventually converge towards identical copies. So as long as things don't continuously change, as changes settle down, we'll all eventually see the same value. All your friends will see your status. They won't be permanently stuck looking at an old one. But what the application sees in the meantime, what's one of your friends, which status one of your friends might be looking at, is really sensitive to the internal details of whatever application you're building, and is therefore difficult to predict. Okay, and so for this reason, it's a little bit difficult to reason very precisely or formally about what eventual consistency means, because it is so dependent on particular implementation details that are themselves difficult to formalize. Okay, and in general, contrast this with what we've been talking about with relational databases and things like Paxos, where they guarantee strong consistency, but there may be deadlocks. And so you can prove that no system can be free of deadlocks and guarantee consistency. And so relational databases and Paxos give up on this liveness property, meaning that they might allow deadlocks in favor of strong consistency. Now, you can show that the cases where deadlocks can occur can be made sort of rare through different design decisions, but they can still happen, okay, fine. And so eventually consistent models say we can't afford the cost of waiting for these protocols to run, and moreover, they may not be necessary in certain application contexts, all right. So where we are now is we're looking at this column, and I've already sort of marked this up a little bit, but what these words now mean, and we'll talk about these a little bit more when we talk about a few of these systems, is the scope of where strongly consistent transactions are supported. And so the scope here of a single record means that I can update multiple fields in one record, and either all the changes will occur, changes will occur or none of them will occur, okay. By the way, I've filtered this list to only include no SQL systems. So relational databases support this across arbitrary records, right? You can update a record over here, and you can update a record over there, and call that one transaction, and the system will only see both of those changes or neither of those changes. And that's what's not supported with these no SQL systems. So within an individual record is supported. Within some of these systems, nothing's even supported. There's no guarantees at all, really. And what this EC means is eventually consistent. So it's not really strongly consistent. It's just not a transaction, but they do have eventually consistent guarantees at the record level, and that's what all these systems were to guarantee. And then this system Megastore that's based on Bigtable from Google, it's also a Google system, defines a notion of entity groups. And this is a set of related records for which transactions are strongly consistent for that group. Okay, so this is a little bit better than just one individual record, is the only guarantee we can give. And it's a little bit less than any arbitrary record in the database. It's predefined entity groups that allow transactions. Okay, so this is sort of a compromise. Fine, and then this most recent system from Google Spanner offers true strong consistency across all the records. And we'll talk about why they made that choice in a little bit. Okay, so another concept I want you to be familiar with is this so-called CAP Theorem from Eric Brewer in 2000, follow-up paper by Lynch in 2002, where they define these three notions, consistency, availability, and partitioning. And the way this is often described is you have to choose two of these. You can't get all three of them to choose two or sacrifice performance, but I don't really like thinking of it that way. And Eric Brewer has also sort of described that maybe that's not the right way to think about it. And the reason is because it's not clear what it means to choose consistency and availability at the expense of partitioning. Okay, so what is partitioning? Partitioning means, well, if you've got a big distributed system with hundreds of nodes involved, hundreds of servers all communicating with each other, and some segment of them lose communication with the other servers, can those two segments still make forward progress in the application independently and sync up later? Or does everything have to stop and wait, or certain nodes have to stop and wait in order to reestablish communication? So for example, if you have a master node that controls everything, and you have some worker nodes that lose contact with the master, there are many designs at which you can't make any forward progress until you reestablish connections with the master node. You can do no useful work because you're waiting on communication, you're waiting on that last message from the master to tell you what to do next. So in those cases, you've given up availability, right? You go down, those nodes are no longer accessible or doing useful work in the context of a network partitioning, okay? On the other hand, if you do say, well, sure, we're gonna continue doing useful work even independently, then it's not difficult to show that you can arrive in an inconsistent state, right? Updates are coming into this partition and updates are coming to this partition of the network and sometime down the road, communications are reestablished, you find out, oops, your replica has one value, my replica has another, which one's right, well, we're gonna have to sort it out, but meanwhile, we've already sort of exposed these values to the application, so in some sense, we're demonstrably inconsistent, okay? It's the point is you can't get all three of these. All right, so you've either sacrificed availability or you've sacrificed consistency by allowing things to continue working. So conventional databases essentially assume that there is no partitioning, and again, this is a function of them only operating on tens of nodes at a time, right? They didn't go to this sort of thousand node scale or planet-wide distributed systems. And so you could kind of assume that there wasn't really a need to worry about, well, what if queries are coming into half my nodes and they can't talk to the other half my nodes and so on, they're all sitting there in a cluster that's in your data center, not even in your data center, in your server room to some extent, and so that really was an issue that they were thinking about too much, okay? And the node SQL systems do need to worry about this. They are very large, they are very distributed. There are different kinds of Byzantine failures happening all the time because of the sheer scale, and therefore they choose to sacrifice consistency instead of availability, okay? And so graphically, you can look at this in this sort of triangle form, and you can put different systems on sort of an edge here where relational databases assume consistency and availability, but assume partitioning can never happen while other systems need to tolerate partitioning but give up on availability. And they are ensuring that certain kinds of transactions are gonna be consistent, okay? And then other systems say, well, we're gonna give up on consistency, but you can always do useful work, okay? So fine, and really the important thing here is the scope of the transaction that I put in that table is sort of critical here too. It's not sort of nothing except for a spanner over here even tries to provide global transactions like relational databases do, okay? So fine, so I'll pick up here in the next segment. So Rick Kittel wrote a nice paper in 2010 about scalable SQL and no SQL data stores where he had a taxonomy of these systems and placed popular instances of these systems into that taxonomy. And so this slide corresponds to his grouping where each color is one of the groups that he defined. So he sort of lumped them into key value stores, document stores, and extensible record stores. And so what he went in by that was a document is, you know, an example of this is an XML document or the JSON object that we looked at in the Twitter assignment where you can have sort of arbitrary nesting and it's also extensible. You can add new things to it whenever you want. Okay, so no top-down schema being enforced, okay? Extensible record you can think of as much like a database record except that new attributes can be added. Okay, so there is some notion of a schema that's used for different various purposes and in particular there's sort of groups of attributes that are manipulated together, these families. But you can stick new attributes on an individual row which you're not really allowed to do with a relational database. And then finally a key value, I'm using the term object here, is a set of key value pairs. So the difference here is that there's typically not a schema of any kind so you don't care what keys there are. Could be any keys at all. And there's no exposed nesting. And what I mean by that is, you know, a value can be anything you want so you might have some kind of complex object in the value but it's not going to be sort of visible to the system. It's just gonna be a blob, a black box object that the system doesn't know anything about. Okay, so it's sort of only one layer of nesting is aware of the system, unlike a document store or a document object that might have multiple layers of nesting that are exposed to the system. Okay. And so his characterization of NoSQL features, you know, admitting that there's perhaps not a formal definition of, well there certainly isn't a formal definition of NoSQL but the term tends to be applied in contexts of systems that have these features. So the sum ability to scale simple operation throughput to many, many servers. And by simple operation you remain key lookups or maybe even attribute lookups or reads and writes of just one or a few records, right? So these sort of needle in the haystack kind of operations as opposed to these big analytic queries like we've been talking about with MapReduce and with databases. Okay. And second criteria is the ability to replicate and partition data over many servers here. So, you know, break a single large dataset into multiple pieces automatically if you don't have to manage this yourself. And here you might see the term sharding and horizontal partitioning. The difference between the two if there is any is not particularly important. So you can think of them as synonyms. Whenever you see sharding think horizontal partitioning of a database table. You'll see the term horizontal partitioning use more in the database community and sharding use more in the NoSQL community. Okay. And then a simple API. So no query language. This corresponds to the first bullet. These are the simple operations. And then critically a weaker concurrency model than what I'm saying is asset transactions now. And we'll talk a bit about asset in the next slide. I'm not gonna go into a lot of detail here. There's, you know, 40 years of research on this topic. Too much to cover in this course especially when we're mostly focused on reading data and analyzing data as opposed to concurrency control. But we will spend some time on some techniques of concurrency control in a minute. Okay. And then some efficient use of distributed indexes and RAM for data storage. So this is kind of minimizing latency is their emphasis as opposed to just throughput. And then typically we have this ability to add new attributes to data records in various ways as we talked about on the previous slide, right? So the lack of a no schema is what you can think of here. All right, so this is no schema, no transactions and we'll go into more detail there. No query language, no SQL, right? And high scale. Okay. So asset and he talks about this term base that I never quite caught on. I wouldn't typically use that this term. And I'm not sure I recommend you do either. Certainly asset is much more permanent in the vernacular than this base is it's, okay. So asset is an acronym standing for these four concepts. Atomicity, consistency, isolation and durability. And just briefly this is, you know the context here is when we're modifying records in let's say a database and we can be modifying lots of different records across different tables, anything we want. And the point is they're all lumped into one transaction and so each one of these refers to, that's the context we're in for each one of these concepts. So atomicity means that the entire transaction either needs to succeed or it needs to fail, right? You got allowed to have partial transactions succeed. Consistency is the slipperiest one in my mind. And this quote down here maybe captures that. So it's sort of any data written to the database must be valid according to all defined rules. And the question is, well, where do these defined rules come from? Sometimes there are actually integrity constraints in the database. Other times they're just sort of business logic rules perhaps enforced by the application or just assumed by the application. And so it's a little bit difficult to say prove a system is achieves application level consistency. But that's the goal. The point is if there's only certain allowed states of the database to have, you should not have a system that allows transactions to put you into an invalid state. Okay, surely you'll be able to go from working state to working state. Isolation means that while the transaction is occurring, other readers and writers can't sniff partially completed values. Okay, partially you can't sniff values or data items before the transaction is complete. They only get final states. And this one is the one most often relaxed in various ways in part because it's very expensive and also because it's not usually all that critical. And the durability just means that if you report back that the transaction succeeded, it needs to have actually succeeded. Meaning it needs to be written out to some kind of non-voluntile storage so that if the power goes out and the machine crashes, you don't say, hey, whoops, you know that transaction I accepted yesterday or committed yesterday. Well, you need to do that again because it didn't take. All right, so that's not allowed. So fine, so these all make some sense with a little bit of slippery ocean consistency, as I mentioned. And then the pun here is they tried to sort of force an acronym on base. And this isn't Rick could tell this came from elsewhere. But the idea is, well, it's basically available. There's some notion of soft state and it's eventually consistent. And we talked about eventual consistency, at least an overview in the previous segment. Fine, that's all I'm gonna say about that. So something else I like about this paper is he sort of says, look, you know, the major impact systems here are these three. This MIMCASH or MIMCASH-D Dynamo from Amazon and Bigtable from Google. And the reason he says these are the major impact is you can kind of trace the lineage and show that other systems are basically taking ideas from one of these three early systems. So MIMCASH is very, very simple. And we'll talk a little bit about one particular technique that it made popular in a minute. But it's essentially just, hey, look, let's load everything into memory, scale it out across many, many machines, right? And then we'll be able to serve read requests without having to go sort of query the database. We'll just be able to do it directly from memory. And what's also made this very, very popular is you can kind of install it on top of your either scale out or non-scale out database. And it sort of just works, right? It just makes things faster for read heavy workloads. And that was kind of a nice thing. So that's an older system sort of on the scale of 2003, but it's still very widely used and very, very popular. And there's been all sorts of extensions to it. And so we'll talk about the perhaps the most basic version. Amazon's Dynamo paper, which has been somewhat more recently released as a cloud service called DynamoDB. What they did was they didn't invent the concept of eventual consistency, but they did sort of show that if you relax the consistency notion, that will allow you to scale way, way out. Okay. And so data fester not be allowed or not guaranteed to be up to date, but updates are guaranteed to be eventually propagated everywhere they need to be. And we give an example of why this was a good idea in the last segment. And then Google's big table that we'll spend some time on demonstrated that record oriented storage could scale to thousands and thousands of machines. And that was something that databases had not shown. Okay. So let's talk about each one of these systems in turn. So MIMcached, as he says, the main memory caching service, no persistence. The basic version is no replication, meaning that there's not two, there's only one copy of every cached value. So if something goes down, if that goes down, then it's gone. That's okay because it's sort of a cache. It's not assumed to be the golden copy of anything. Okay. That being said, there's been many extensions that provide various, these various features, including MIMBrain and MIMbase. So it's a pretty mature system and still in wide use. And an important concept that they adopted in this context was consistent hashing. So I wanna explain a little bit about what consistent hashing is. So that's one takeaway from this lecture. Okay. So first, just for those of you without necessarily having too much of a background in programming, what is hashing? So what is regular hashing? Well, the problem we're looking at here in this, hashing is a very, very general concept, very fundamental to all of programming. But in the context of what we're doing here, we're trying to assign data keys to a bunch of different servers. Okay. And the simplest way you might do this is sort of a round robin thing, right? The first key goes to the first server. The second key goes to the second server. And you keep going until you run out of servers and you start back over to the first one. And that's implemented by this modulus function. Okay. So each of these data keys is placed somewhere on this, on one of these servers at various points. Fine. That's how hashing works. What's wrong with that? Well, what happens if I want to add more servers to the mix, right? I want to scale out to, I want to double the number of servers. Well, every existing data key now needs to be re-evaluated, it's a location needs to be re-evaluated by computing K mod two in instead of K mod in. Which means every single data item is going to be remapped. At one. So every time you want to add a server, you end up having to move all the data that's already in the system and you're dead in the water. Okay. So what you want is some notion of consistent hashing where consistent means when I place something somewhere and I add more servers, it's typically going to stay right where it is. And so there's a pretty good trick that's pretty simple to understand for doing this. Okay. So here's how it works. The first key idea is you're going to map the server IDs into the same space as the key values themselves. Okay. So we apply a function that I'm going to leave sort of unspecified and map server one to some point on this circle and server two some point on this circle and server three some place on this circle. And now what that does is divide this space up into three sections. Okay. And now each key that comes around I also map it into the circle. So this gets key one and this gets key two, key three, key four, key five, key six, key seven and so on. Okay. And now this entire region is one, server one, server two, server three. This entire region is responsible for all of these data keys. Sorry, this server is responsible for all these data keys. And then this server is responsible for all the data keys in this region. And this server is responsible for all the data keys in this region. Okay. And so what's nice about this is now when I add a new server, server ID equals four. Well, let's say it comes around and gets stuck right here. Well, that's a bad spot for my example actually. Let's say it comes around right here. Well, you just apply the same rule. It should be responsible for every key in this region, which means that these two guys need to be moved from server three to server four, right? But you only have to move that one section of data. And so it splits at most sort of K over N data items. All right, so this is a nice trick and there's all kinds of extensions for supporting replicas. We need to put data items in more than one place. Well, you just sort of hash it to two different places. So if you wanna hash the same data under two different places, compute H of, let's say D is the data key and then also put it in all three places and you're done. Okay, so how do we serve requests in this setup? Well, imagine the key spaces divided across various servers in the same way we described and a request comes into the leader that may be elected among the servers or may just be assigned top down by the system or could even be assigned randomly. And the naive way of doing this as well, this server would check to see whether it has the key being requested. And if it doesn't, it would just forward the request onto the next guy. Okay, but this is no good because there could be many, many servers and this would incur sort of a latency every time you'd wanna do a read. So a better way of doing this is for each server to memorize the locations of other servers in the ring and which servers it memorizes is like this. So it knows where itself is, it knows A plus two, A plus four, A plus eight, A plus 16 and so on. And what it does is it knows the key range being managed by each one of these servers. Okay, and so what it can do is forward the request to the server that is closest to the key range it's looking for. Okay, and so this takes a logarithmic number of hops away. You know, imagine there's lots of servers here and keeping all this information straight when new servers come in is still, each server only has to keep track of the logarithmic number of servers as well. So everything sort of ends up being logarithmic to maintain this. Okay, okay, so Dynamo from Amazon in 2007, which was a paper and again a few years later it's been released as a cloud service called DynamoDB. Okay, and so here we're looking at, you know, scales to thousands of nodes. You can look up things by primary index and basically nothing else. It's just a key value store just like MimCache. Okay, so what are some of the tricks it led to? Well, so some key features are that it has, it's just talking in terms of DynamoDB which is the implementation you can now go and use and pay for. One of the neat things here is that it offers a service level agreement on performance. And so at the 99% tile, you know, they promise to respond within 300 milliseconds for 99.9% of its request. And the reason they do this on the 99% tile is opposed to some sort of notion of the average, the mean of the median is that would artificially penalize the people who are using it heavily. They would get a disproportionate number of failed requests. It'd be easy to satisfy the average by only focusing on the lightweight users, for example. Okay, so Dynamo, the system, it's a distributed hash table. This is what DHT stands for. And each key is stored at, or sorry, each value is stored at locations, multiple locations for replication purposes. And it's up to a replication factor of N. And so at location K, K plus one, all we have to K plus N minus one. And they achieve eventual consistency through vector clocks which I'll describe in the next couple of slides. And so reconciliation of potential conflicts when things are being read and written happens at read time, which is another maybe interesting feature of Dynamo. Okay, so writes never fail and they cite in the paper that the reason for this is poor customer experience, right? So if you're sort of typing into your Google Doc, well, not on Amazon. If you're building an application, a web application where you say you update your status on some social networking site and it comes back with an error message that says, sorry, couldn't commit. You know, somebody else was editing the same, or you're editing the same status from somewhere else. Their claim is that that's more disruptive than getting the wrong read, which just seems reasonable to me. All right, and so conflict resolution for many applications may be the most recent right is the one that wins, or you can actually have the application control this. In some cases you may even sort of go back and ask the user to resolve the conflict manually. Okay, so the goal with vector clocks is to detect conflicts in a concurrent read-write scenario, but not to necessarily do anything about them automatically. Okay, so in this scheme, every data item is associated with a list of server timestamp pairs that indicates its version history. And so in this example, some value D was read by a client and D1 was written back at the server called SX. And so what SX does is append this fact to this vector clock. So in timestamp one, server SX created a change. Then some of the client reads D1 and writes back D2. And you might want to append to the vector clock, both values. But this change descends from D2 descends from D1. It was handled by the same server. And so you can garbage collect this part of the vector clock. So it's the same server with a higher timestamp, means that the old version with the older timestamp is not needed anymore. And since there were no other conflicts to work on. But now independently two different clients read D2 and write back different values. One writes back D4, one writes back D3. And these two requests were handled by different servers, SY and SZ. And so these facts get recorded in the vector clock since they're different. Now, the context here will, this we call this sort of vector clock, the context. The context will reflect this fact when the next read comes in. And it'll see that, oh wait, there's a conflict here because we have the same timestamp, but two different servers. And you can either ask the client what to do or you apply some heuristic where the later one runs. These may not be timestamps like integers, they could be sort of actual clock timestamps in which case you make an arbitrary decision and just pick it and go, okay. So that's how vector clocks work. So in the example, just to sort of run through what we just saw, a client writes D1 to server SX increases value. Another client reads D1 and writes back D2 also handled by SX. And D1 was garbage collected. Then separate clients read D2 and write back D3 and D4 on two different servers, Y and SZ. And then another client reads D3 and D4 and finds that there's a conflict, a system reports that there's a conflict to be handled. Okay, so let's practice with these. Here's two different vector clocks and you figure out whether there's a conflict, whether they represent a conflict or not. So in this case, we have server SX with the timestamp of three and on this data server SX with the timestamp of three. And then each one is a different server with different timestamps. So is there a conflict? Well, yeah, there is because on one version path, SY made a series of changes and on another version path, SZ made a series of changes and they didn't talk to each other because they don't reflect each other's changes. So yes, there is. And then this one is at the same server at a later timestamp. So is there a conflict here? Well, no, because they weren't handled by different servers. So really just this one subsumes that one and we're okay. So on this one, we have server SX with three, server XX with three, server SY with six, server SY with six, so they agree so far. And then this one has an extra change of server SZ with the timestamp of two. And so no, there's no conflict here because they agree wherever there's an extra change on top of this one. So this guy wins. Okay. And this next one, server SX with the timestamp of three, server SX with the timestamp of three, server SY with the timestamp of 10 and this has server SY with the timestamp of six and then some later change at SZ. So is there a conflict here? Well, there is because this one is later in time on SY, right, 10, than this one is. But then, and if that was all there was, if it wasn't for this guy, if it wasn't for this guy here, well, that'd be okay. We would just pick this one because it's later. But because this one's here, we now have some changes at SZ and some changes at SY that were both forward in time from the latest point that they agreed on. And so we don't know how to resolve that. And so, yes, there is a conflict. And then similarly here, SX and timestamp three and SX and timestamp three, well, here we have SY and SY, SY 10 is the same as the last one, but here instead of six, it's 20. And so it's later than this 10 and then further we have a change at SZ. And so is there a conflict here? Well, no, because this one is strictly later than that one is. On all the servers that they share, it has later timestamps. So it strictly subsumes it. And so no, there is no conflict. Okay. So those are, in the last time we talked about consistent hashing, in this segment we talked about vector clocks. These are two little gadgets to be familiar with because they come up a time and again in these no SQL systems and in other systems in general, okay. All right, so Dynamo also talks about a way to parameterize the level of consistency. And this comes up occasionally in paper, so I just wanna make sure you're exposed to it. So the idea here is that you have two parameters R and W and R is the minimum number of nodes that need to participate in a successful read, okay. And W is the minimum number of nodes that need to participate in a successful write. So this is sort of how many replicas you write to and how many replicas need to respond from a pool to know that you sort of have all the information. Right, because if everybody's updating everything all the time, it might be that you have 20 different servers, they all have a different version of the data that you're trying to read. And so the question is how many of these do you need to sample before you feel like you have the right one. And so for a replication factor of N, if R plus W is greater than N, then you can claim consistency. But often you wanna set R plus W less than N in order to achieve lower latency. So you don't wanna have to actually contact too many servers in order to satisfy some read request or write request, okay. So if you see that notation, this is what it means, but I'm not gonna describe too much more about it because I think that the sort of the formalism falls over a little bit under a little bit of scrutiny, but that's what they're talking about when you see this discussion on, say, blog posts. Okay, so the next system we're gonna look at is CouchDB, which began in 2005 but is undergoing a lot of updates and is still pretty popular today. So this is an example of one of these document-oriented stores that Rick and Tell talked about. And so here, just to look at the features we're talking about here, we've got the scale, we've got primary index. Now, we start to see the requirement for secondary indexes, meaning that you can look up not just by the key, but by other kinds of values in the document, and we'll see how that works, okay. And then transactions also are a little bit better. We can, what I mean by record here is that you can change multiple values within a single document object, which is a set of key value pairs, as opposed to just a single key value pair, okay. And then there actually is some support for analytics and we'll see how this works too. This is all through the notion, this concept of views in CouchDB, but you can actually run little map reduce scripts to compute derived new values from the existing document source. All right, and then the other notion we have here is views, which is somewhat unique. You can see this column is fairly sparsely populated, and I hit the concept of views pretty hard when we're talking about relational databases and argued that it was pretty fundamental to not only a relational model, but a pretty important concept in general, and gave you this notion of logical data independence. And so whenever you see views, that's a good thing, and so CouchDB has them as well. All right, so the data model here is documented, or as we said, where a document is a set of key value pairs. And so here in this application, these are perhaps blog posts where they have a subject, the key of subject and the value is some text string, and they have an author, and they have a data which they're posted, and then they have a set of tags, and this is sort of what makes it a document model. We said they could be sort of nested, so this is okay to have a list of objects here. And then they have a body, which is another string. And you'll notice that this is actually JSON compliant. We looked at JSON in the Twitter assignment, and so it's another occurrence of this. And one of the reasons why I wanted to make sure you were working with JSON before is it does get popular. And so CouchDB, all data is represented in JSON, and all the requests coming back for all are represented in JSON as well. So how do updates work in this context? Well, as we mentioned, you can make, it is fully transactional within a single document, so full consistency within a document, meaning that I can sort of grab hold of the document and logically sort of lock it and make whatever updates I want. Now it doesn't actually take a lock because it uses what's called optimistic concurrency control, meaning that it sort of assumes that conflicts aren't gonna happen optimistically. And so now if I check out that document and try to make updates to it anywhere I want all throughout it, and then try to commit that change, and someone else is doing this thing, has done this, has committed their own changes, in the meantime, my changes will fail. But we assume that that doesn't happen that often, so it's okay to just sort of fail in those cases. And really what you have to do in that case is check out the new changes and make whatever changes you want. Fine, but there's no multi-row transaction. What I mean by multi-row here is multi-document, right? So you can't say, for example, when I update my status, I have to update my document that describes my current state of the world, but maybe I also want to update all my friends' walls, right, their pages that reflect my new status, and you can't guarantee that that happens synchronously in CouchDB. But in this particular application and in many others, maybe that's okay. All right, you can do it in two separate steps. You update your status and then you update theirs, and there'll be a period of time when their view of your current status is out of date, but maybe that's all right. Fine. All right, so this notion of views works like this. A view specification, which is itself a CouchDB document, a set of key value pairs, but it's sort of a special one, looks like this. There's some metadata information here, and then there's this key views, which is a dictionary of things. And so this specification has three views in it, one called all, one called by last name, and one called total purchases. And each one of these views is going to be, is implementing a set of key value pairs. All right, it's gonna be implementing a dictionary. And so the all view, okay, ends up so fine. So how are these implemented? Well, speaking of recurring concepts, views are recurring from logical data dependence and relational databases, and here map reduces recurring, even outside of the context of literally hey dupe, okay. And so here the map function reduce function is actually written in JavaScript. Again, everything in CouchDB is JavaScript, and they look like this. So the all view has just a map function, no reduce function, and in JavaScript you can write these anonymous functions like in this sort of syntax. So this function doesn't have a name, it just says hey, I've got a function with no name with a single argument called doc, and the body looks like this. It says if doc.type equals customer, then emit a key value pair where the key is null and the value is doc. So we don't really care about the key in this case, we just care about the doc. All right, so this is all customers. Fine, the by last name view, you can imagine the key is gonna be last name in this case. And so here we have another anonymous function. If doc.type equals customer, then emit the last name followed by the entire document. And so now this allows clients to search efficiently by last name. And these views are in CouchDB are computed you know, they're materialized sort of eagerly and stored in these distributed B2 structure or index structures to support various lookups. And so this is how they implement those secondary indexes in that column in our table, okay. So now we can look up by last name as well as by document ID. And so you can even go a little further, you don't have to just do simple key value pairs, you can even do a little bit of computation. So total purchases here, the map function again takes in a document and says, well if doc.type equals purchase, so I'm not talking about customers anymore, just purchases, then emit the key of doc.customer and the value of doc amount. But then we're also gonna find a reduce function and the reduce function adds up all the values of the amounts. And so here we have given a customer, if I give you a customer, you can return it's the total amount of all of its purchases, okay. And CouchDB maintains all these views as things change. With eventual consistency, sort of semantics. And so we have a lot of things going on here with one concept. We have secondary indexes, we have logical data independence. We have map reduce computation. Okay, so I had checked the box in the table saying that CouchDB could do joins in analytics and it's not really quite true, they're somewhat limited. So let me show you an example of how they do sort of joins. They have this concept of view collation and what you can do here is write a map function in a view that looks like this. And so here we're trying to group together all the comments associated with a post. So it's sort of a join between the comments table if you will and the posts and the blog posts table. And so this map function is the same kind of thing we did in the map reduce assignment to do a join. This is what you have to do to join a map reduce is you pretend like the whole collection of documents regardless of type, regardless of source relation, here's one big set of objects. And then your map function, you sort out which one's which and make sure to hash it on the same key. So here the ID of the document post and then in this document doc.post refers to some post ID. It's now post ID equal to one and all the comments associated with post ID equals one will go together. But they do this funny thing. They say the key here in the key value pair is document ID followed by the number zero in the one case and the number one in the other case. And what's happened here is that all everything in CouchDB is sorted. And so you've got things sorted by document ID and then sorted by this extra bit. And so now you have the post ID coming first and then all the comments coming second. And now what you could, you know you think you could just do a reduce function to actually compute the join that you might want but it turns out reduce functions don't quite work that way and there's some scalability issues. So what people recommend is to do this view correlation trick where you still spit out the data in sorted order but then you can query it. Say a client application could query it like this where they say start key is just the post ID without the extra prefix at all and the end key is post ID with the number two. And so now this gets the full range of the post ID which has the key of zero followed by all the comments with the post ID of one. So you got everything in one big list and now you could process it in order to show it, say showing the application. So you're essentially constructed the join by cheating with the order. Okay, so fine. So perhaps you could argue that you have some joins in analytics but it's a big glib to say so. Okay, so the third influential system that Rick mentioned in the paper is Bigtable from Google which is a paper from 2006. And so here we're looking at primary index lookup, secondary index lookup, transactions are also at the sort of scale of an individual record. Joins analytics is not supported by Bigtable directly but in HBase which is the open source implementation of it and in Google's implementation as well it was sort of designed to be compatible with MapReduce. So you can run MapReduce over here over the same data that's stored in Bigtable. So there's sort of complimentary. And then there's some notion of integrity constraints or schema here and we'll talk a little bit about how that's implemented. There's no views that I can see and there's no sort of language level or algebra level for manipulating these things or all sort of no SQL style micro interactions with individual records and cells. Okay, so this is a paper in OSDI from 2006 and some overlap with the authors of the MapReduce paper and it was sort of designed from the start to be complimentary of MapReduce. So if you can remember what was one of the main things that was missing from MapReduce or maybe a few things that were missing. Well, in particular you couldn't look things up by index, you couldn't get these sort of low latency accesses. And so for example, if you wanted to find all the records, given a big data set, you want to find all the records and some other data set that correspond to it. You want to do some sort of a join. The best you could do is process, you had to touch every single record. There was no way to zoom in to just the right ones you wanted. Okay, and so Bigtable provides that fast key-based lookup but you can still process the overall data as a big set of key value records with MapReduce. Fine. So the data model here is a sparse distributed persistent multi-dimensional sorted map. And what they mean here is that you can basically access any cell in a Bigtable by giving a row ID, a column name and a time stamp. Time stamp isn't really described in this English description here. It's for versioning. So after you have updates you'll keep track of past versions of the same cell. Okay, and so if you provide these three parameters, Bigtable will return you a string quickly. So each row is, data is all sorted lexicographically by the row key, which is this row ID in this bit, right? So this is some sort of primary key in sort of relational language or just a key in kind of no SQL framework. And then this key range, I say the integers, and contiguous sub ranges of this set of keys will be assigned to a tablet. Okay, so this is a little different way of dividing up the data than we've seen in the past in at least one system. We talked about a parallel database model. We happen to use the example from Teradata. And so how did they break up data? Well, they did it by hashing, right? So every individual record would be sent to a server according to a hash function, which you can generally just think of as sort of round robin. The point is is that two keys that are next to each other in space, so sort of timestamp 5 p.m. and timestamp 501, there's no reason to believe that five o'clock and 501 are gonna be on the same server in Teradata's model. Here they are. So what are the pros and cons with this? Well, if you're going to typically access a whole range of keys at once, it's pretty nice to be able to, when you get one, you get the others two sort of for free because you're pulling them all back. However, if one particular key range is much more popular than the others, but using the time example again, the most recent data perhaps is the most popular. And so if all the requests are going to that one key range, then you've got a bunch of idle servers hosting all the other tablets that are corresponding to older times and all the requests are going to this one tablet. And so for that reason, Teradata sort of chooses to hash everything so that on every request, all the servers may have to be accessed, but that's good for scalability. Okay, so pros and cons. All right, so the tablet here is the unit of distribution and load balancing. So they'll move tablets between servers as things start to get unbalanced. If your key range is January, February, March, April, May, and there's a whole lot of data coming in from March, they'll split that into multiple tablets and start moving those tablets around between servers in order to balance things. Okay, so within a single table, you can have these groups of columns called column families and the column names have the family right in there as a qualifier. And this family is the basic unit of access control so you can provide permissions on a group of columns, memory accounting, and that they're sort of allocated as a unit in memory and then disk accounting. So they're moved around on disk as a unit as well. Okay, and so I think this point that typically all columns in a family are the same type, which I find a little unusual because they sort of talk about it being the basic unit of access control is to suggest that there's, you know, things that go together for access control, sort of social security number and employee ID or something may or may not be the same type. So they're sort of a logical grouping requirement that they seem to be trying to meet, but then they're the same type, they have to be the same type which is for a very technical reason, specifically because they wanna compress these things. So if you have a whole bunch of integers, it's easier to compress than if you have a mix of integers and strings. Okay, so I think they're trying to kill too many birds with one stone here. All right, and then each cell can be version which is the third part of that key lookup, right? Row ID, column, name, and timestamp. And each new version increments that timestamp. And so here you can enact different kinds of policies where you only keep the latest in versions or you keep only the versions since a given timestamp. So how these tablets are managed is a master will assign the tablets to tablet servers and the tablet server handles region rights from the tablets it controls, okay? And so clients communicate directly with the tablet server as opposed to having to go through the master every time which helps with scalability, okay? And then when a tablet starts to get too big, it'll split it and load balance it. All right, so the metadata keeping track of where tablets are located is organized itself in another tablet. So there's a root tablet here that describes a, each record in here describes a group of records, a group of location records in a, you know, bigger table. And then each one of these metadata tablets gives a location of a particular user table, okay? And so those values sort of keep track hierarchically of where everything is at one time. And so Chubby that they mentioned in the paper is a distributed lock service for controlling access to these things. I'm not gonna talk too much about it. Okay, so how are region rights handled in this system? Well, there's a table in memory that stores a sequence of updates as they occur, okay? And a write operation is, you know, adds a record into the memory resident table but it's also written to a log for fault tolerance purposes. Okay, so if this goes down, it reads a tablet login can reconstruct what's going on. And then read operations are served by reading these SS table files, you know, the actual data itself, but then also by applying the updates from the mem table on the fly. Right, so it needs a stream, it says here's the value and then here's the stream of updates I need to apply that value to get the true value. And then there's two, so this is fine, but what happens when the mem table gets bigger and bigger and bigger? Well, there's two types of events that occur to do the bookkeeping here. So one is a minor compaction and this is when the mem table gets big, it gets written out into a new SS table file and the changes are merged. Okay, and then a major compaction is take all the SS tables and rewrite them all into one big one that may be split into multiple files and also clean up any deletes that have occurred. So deletes are just appended as instructions, but aren't necessarily, it doesn't actually remove anything. So they're sort of garbage collected, okay. So this way you can keep sort of the read throughput pretty high and sort of keep this upkeep going on in the background. All right, so there's a host of other tricks here too where they can do various forms of compression specified by compliance, which can be specified by the clients or some different ways of doing it. They use Bloom filters to speed up existence tests. So if I give you a row ID, a column ID, and a timestamp and say find me this value, what these Bloom filters allow you to do is to very quickly determine whether that does not exist in the system. So these Bloom filter data structures are pretty cool and I'm gonna walk through them in this course in a couple of weeks, okay. So they help you quickly determine whether that key does not exist in the system and avoids disk accesses during reads, all right. And then there's locality groups you can define which another layer of organization on top of families. And these are groups of column families that tend to be accessed together, fine. And then another trick here is to make sure that the SS tables, these disk chunks are actually immutable. They never get written directly. The only time they get written is when these major compactions happen and the whole thing is sort of reorganized. And so that means that the only writable data structure is this MIM table. And so the amount of concurrency control to keep things down to the remains pretty simple. Okay, okay, so Google Bigtable again had a lot of influence and one of the major outcomes is just like MapReduce it was taken up and implemented as an open source project called HBase. And so where Bigtable was compatible with MapReduce HBase was compatible with Hadoop. And in one slide there's not much difference here but I just want to mention the terminology so you've seen it. That there's table at the top level and then region, store, and then MIM store and store file. And so the exact names of these structures are a little bit different and they did insert one more layer of abstraction which is this region, fine. And then how this sort of is compatible with MapReduce is that each one of your map functions will process a single tablet. And so it's sort of one to one with the blocks of data that we talked about when we talked about MapReduce. And then I just sort of ask this question sort of open ended is that there's this notion of speculative execution of MapReduce that we talked about whereby for fault tolerance reasons it might kick off the same map task twice on two different replicas of the data. And the reason for this is that if one fails, well, you have the other one, right? So you don't have to start over from scratch. But in this environment where you're now working on data that is being actively updated and that's what HBase and Bigtable are designed to support is updates, it's not quite as clear to me what's gonna happen when you have, it's possible and because of eventual consistency that these two tablets will not always agree instantly on the same record they'll agree but different records within the tablet may not. And so just an example of when you sort of mix these two systems and there are no sort of system-wide transactional guarantees or system-wide even properties that you can run into trouble. And I think that a general theme here with these sort of no-SQL systems including the ones implemented or designed by Google is that you're kind of offloading some of those responsibility to the application, to the programmer to sort of sort of sort this out and make sure it's okay. Okay, and we're gonna come back to that in just a couple of minutes. All right. So after Bigtable, several years later, there's a paper by a bunch of folks at Google about a system called Megastore that I'm not gonna spend a lot of time on but it's basically, you know, they found this point I sort of just made is that these loose consistency models can complicate application programming and what they wanted to do is provide a little more system support for certain kinds of safe updates. Okay, so here instead of full transactions being safe within an individual record as they are in Bigtable, they've extended it with this notion of entity groups. And so an entity group, this should be on this slide, and an entity group is a set of records that tend to go together, tend to be accessed together. Okay. So maybe again, this is the blog in all of its comments, for example. You each one of these is a record, is it's like sensible record store, so it's okay for them to have different schemas but they all tend to go together. Okay, and so what they've done is extend transaction support over an entire entity group, you know, a set of related records. Okay, so they still get the scalability by not requiring full system wide to global synchrony, but they allow you to sort of, they get away from this problem of, you know, very frequently I might need to update one record and then update all of its sort of children records at the same time, and I can't do that in any kind of safe way. Okay. So fine. Fast forward one more year. And so there's a 2012 paper on a system called Spanner, and I just want to mention these quotes and then we'll talk a little bit about the system and this one is still sort of being explored by the community. It's not available actually for use but the paper is being explored and the idea is being explored. So for example, you don't see an open source, actually that's not true, you do see, there has been a couple open source and limitations of the ideas in Spanner, but they're not quite as popular as some of the open source and limitations of the other Google systems. Okay, so, you know, it says even though many projects happily use Bigtable, we've also consistently received complaints from users of Bigtable, it can be difficult to use for certain kinds of applications. Those that have complex evolving schemas are those that want strong consistency in the presence of wide area replication. Okay. And so then they say, go on and say, we believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise rather than always coding around the lack of transactions. And so this, you know, the database community could have said, you know, well sure, you know, well duh. Right, that's exactly the point. Is that system supplied support for transactions is always a win, right? Because it's difficult, error prone and expensive to try to do this at the application level. And more importantly, it's fundamentally wrong in some sense to do it at the application level because the application doesn't have global knowledge of what's going on. Right, only the system does. So fine. So all those Spanners scalable and the number of nodes, the final quote here. The node local data structures have relatively poor performance on complex SQL queries because they were designed for simple key value accesses. And then algorithms and data structures from the database literature could improve single node performance a great deal. Again, you know, it's somewhat of a Google style approach to the problem of reboot everything, rebuild it all from scratch and then sort of cherry pick and bring things in. So this has been working pretty well and they have fantastic impact in the community. But there's a lot out there in the database literature and in the database systems that could have been used from the start. I mean, in fact, trying to start from the beginning and just say, we're going to build a big Google style parallel database may have been a good choice rather than sort of getting completely away from it and then coming back incrementally and find yourself in a SQL system. Now I sort of skipped over what Spanner is, but it's planet scale database system. There is a SQL like language. Well, actually let's go back to our, I mean, I know what I'm missing. I'm missing our table here. Let me flip back a few. So here it is down here. So I'm missing the slide here where I showed it. So really big scale. Primary accesses, you can access by other attributes. There are transactions and in fact, they're global this time. They're real asset transactions. It's not clear to me whether joins are supported. I suspect they are because I keep talking about SQL but I couldn't find an example of whether it is or not. There is some notion of schema and they do sort of protect against data that doesn't conform to the schema. There is some notion of logical data independence, although they don't talk about it much. There is a SQL like decorative language on top of it. I didn't see much evidence that they're doing a whole lot of fancy optimization and I did just show you that quote of where they say that their performance is sort of poor on complex analytic queries, but that's something that probably could come along somewhat quickly. Okay. So fine. Spain are at a high level. Let me give you a couple more details about what this system does. So the data model here is this notion of directories and these are a set of continuous keys with a shared prefix. So you can think of kind of like a tablet was in big table, but now they have this notion of multiple logical tables being sort of interleaved. And so if you're not used to staring at this syntax don't worry too much, but those of you who are thinking in terms of DDL in a relational database, they have kind of a create table language that looks like this. You create table users with two columns and then you give it this keyword directory and then you have a create table albums with some columns and you have this keyword interleaved in parent users. And what you end up with is something like this where there's a user with all of its albums and a user with all of its albums. And so you can see here that, you know what we've been talking about all of these different systems are experimenting with ways of getting these nested data structures. Hierarchical data structures that look a lot like what we saw way back in the 60s, right? And the motivation is the same as it was then. It's actually really, really fast. When you're gonna access, when you wanna pull up a user and then immediately pull up all of its albums it's really fast to access those this way, right? You know, but I'd probably speculate that the reasons why relational, the relational approach, eventually replaced these, it can and will happen here as well is that performance is not the number one priority a lot of time. It's minimizing the amount of developer headaches. So, it may need to be seen, but I think that this incremental walk step towards a big new scalable relational database is underway. Now again, that doesn't mean that I'm saying use all the old databases. They really were designed for sort of a different workload and they really don't, there really is no evidence that they scale to some of these levels. But that doesn't mean that you sort of throw out all the, everything we learned, okay? But that's more me editorializing. So fine, so how this works is there's a universe master at the very, very top and this is just a singleton. There's only one of these for deployments and they sort of imagine there might be only one or two of these deployments anywhere. So they have sort of one for test, one for production slash test and one for production and that's it. So many different applications will use the same deployment of Spanner. And so this is mostly just status information about the zones. It doesn't interact with clients at all. Then there's a placement driver that's responsible for moving these directory sets of records around for load balancing purposes and this happens on the scale of every few minutes. All right, and then within a zone, there's a zone master that assigns data to Spam servers and then there's a location proxy that sort of knows where everything is and routes requests to the appropriate Spam server and then the Spam servers themselves serve data and so in here it's starting to look a little bit more like big table. You know, zone is essentially an individual big table deployment, okay? So inside of a Spam server, this is where the big difference here is this is where they're gonna try to support fully consistent transactions. So across these, you know, within a group of these replicas, they support two phase commit. This only is needed when a transaction actually accesses data that's across is not constrained in one particular replica, okay? Other than that, it just skips over this logic and it doesn't cost anything, okay? And then one step down below this across, so this is, sorry, I guess I'm using the word terminology. So two phase commit is across groups and when all of them, the transaction only is contained in one single group, then you drop down a level and you run the Paxos algorithm that I didn't talk about in detail but I mentioned exists in order to sort out the reads and writes in order to handle the write, okay? Okay, and the only thing, the other piece I'll mention here is that this term Colossus is new, it's the successor to Google file system and Google file system is the original term for the open source implementation of HDFS which underlies MapReduce and Hadoop, sorry. GFS is to MapReduce as HDFS is to Hadoop. So when I'm throwing these acronyms, actually, that's how to keep it straight, okay? So that's all I wanna say about Spanner in particular, we just take a step back and look at all the different systems that Google has for a second, you know, MapReduce was a paper in 2004 that had a ton of impact, Bigtable had a ton of impact and then later there's Megastore, there's the system tensing that we didn't talk about but it's a SQL system on top of MapReduce much like Hive if you were familiar with that or if you remember we mentioned it and then Spanner very recently. And so you can sort of organize these things into a timeline this way just to kind of get a sense of this and because of these systems that have so much influence I want you to be aware of what they are and sort of how they fit together so it doesn't just sound like a big jumble of terms. So MapReduce was, you know, one of the earliest ones wasn't quite the same, there's actually another one called Sawzall that didn't really get a ton of traction but it was a nice paper. And then Bigtable came a couple years later and I drew a dotted line there representing that they're sort of compatible or designed to go together. You know, one was for MapReduce for analytics, Bigtable is for the sort of micro operations. And then both of these a few years later had an open source implementation in Hadoop and HBase respectively. Okay. Fast forward a few more years and you've got Megastore and Spanner coming very quickly one right after the other and this heavy blue line represents, you know, it's pretty clear that the influence is fairly direct. In fact, I would suspect that there's a lot of code being borrowed and then Megastore makes plenty of references to Bigtable and Spanner makes references to both Megastore and Bigtable and the papers have many, many of the same co-authors. Okay. And then MapReduce depends directly on, sorry, excuse me, Tenzing which depends directly on MapReduce. It provides a SQL layer on top of MapReduce. And then there's some other systems here. One is called Dremel which was originally for very fast aggregate queries but really just aggregate queries but at extremely low latency. So this is, you know, in the analytics camp because you're doing these sort of aggregate questions opposed to sort of micro updates but it was extremely low latency unlike MapReduce which is more of a batch system. And so this is a great fit and it's a very nice system. And in fact, since then they can do joins not just aggregates and more importantly this was exposed as a service that you can just use directly over the web even in your browser called BigQuery. And that's an important one to watch. It's one of the few systems that is available as a service through a, you know, as a cloud service but scales to very, very large data and sports analytics. Okay. And then another one that we won't talk about yet but we will come back to is Pregel. And this adds the one secret ingredient that is sort of near and dear to my heart which is iteration. And what I mean by that is when you run MapReduce you have to do analytics you're sort of taking step one and then step two and then step three and you stop. But for many kinds of tasks especially in data science many of these analytics tasks these machine learning tasks you have to do something again and again and again and again until some kind of convergence condition is reached. And Pregel and one of our systems and a few other systems are the ones that try to notice this limitation of MapReduce and Extended. So people are doing this with MapReduce but they do it sort of in fairly ad hoc ways. Okay. And so we'll come back to that and talk about it but, you know, analytics, low latency, micro updates so there's two big classes of systems and then analytics with iteration is perhaps a third class of system that we'll talk about. Okay, so going back to the grid we can highlight those systems that were based on MapReduce itself. You know, the MapReduce paper itself in 2004 and these language layers on top Pig and Hive in 2008 where Hive is SQL and Pig is a relational algebra looking language that we'll talk about in some detail in the next few segments and Tenzing, which is also SQL and Impala, which is also SQL where Tenzing is from Google and Impala is from a company called Cloudera that's a pretty eager evangelist of MapReduce and Hadoop based technologies in general. Okay, so one trend I think you see is that these declarative languages on top of the parallel processing primitive of MapReduce are really here to stay, right? So people that were relatively against these kind of languages are certainly doing it. Now, it's also fair to say that the enterprises in general have made a pretty significant investment in SQL expertise. So even if they're attracted to the advantages that Hadoop might bring, they're pretty much demanding SQL. So this may be a response to this inertia from having invested in SQL in the past. I think that's certainly true. However, it's also true that the desire for declarative languages is reasonably well founded for reasons that we've already talked about. Okay, so you can put these systems on this timeline and the only point I wanna make about this is that there's a bit of a gap between the paper in 2004 and these systems in 2008. But then as soon as you had Hadoop, the system itself developed a Yahoo and released as an Apache open source project, you sort of immediately see an ecosystem start to emerge of extensions to it that add these, in particular adding these languages on top. And so I think that the need for a high level interface is motivated by how quickly they came around as soon as Hadoop was out. And again, it didn't stop with these later systems a few years later. Okay, and actually not on this page, there's potentially hundreds of, if you include research projects based on extensions to MapReduce, there are really, really a lot. Okay, and so this is some of the most popular ones. All right, so another subset of this grid that you can look at is just the NoSQL systems. Now, the whole last few segments have sensibly been about NoSQL, but I've also included these kind of analytics systems in here, MapReduce based systems and a few others, for example, Dremel and Spark and Shark. Dremel is a system from Google that is the back end of a query as a service system called Google BigQuery, which is pretty nice. I'd recommend taking a look at it. You can sort of upload data and put it in there and it doesn't matter how big it is and you can kind of query it at very low latency speed. Spark and Shark come from the AMP Lab at Berkeley and are part of the Berkeley Data Analytics Stack, BDAS or Badass. And Spark is a language level on top of, it's not MapReduce, but on top of a parallel processing system. And Shark is an SQL layer even on top of that. Okay. And so a couple of the distinguishing features of Spark is it loads everything into memory, processes everything there when possible, writing things out to disk only for fault tolerance reasons, so much less often than MapReduce. And it also supports iterative processing, which is pretty important and we're gonna come back to that a little later in the course. And then Shark again is just SQL on top of this. All right. So within these NoSQL systems, one thing you can look at, excuse me, there's not much I wanna say about this diagram, except that to point out that there's been sort of the cambering explosion. So first you had sort of memcache, which is again just this caching layer for really the real system. And the real system was a bunch of my SQL databases that weren't really working all that well for the requirements they're having used for, but you could bring things into memory and keep it there, looking it up by name. And it was just sort of a performance enhancement, a free performance enhancement if you invested in this system. The real approach of throwing out everything you have and replacing it with a NoSQL system came a little later. And so the only couple of points I wanna make is that there's been kind of a cambering explosion of different systems around this time. And that this space in here is not as nowhere near as empty as it looks. I picked out a few systems here, but really the emergence of new systems in the space hasn't really slowed down much at all.