 I've worked on a lot of different things at TenGen in the last two years. One of them has been our integration with Hadoop. And this, thinking a little bit back, if you sat at the round table, the question came up about integration of systems. And things like our Mongo integration with Hadoop have come about because their strengths that Mongo has and their strengths that Hadoop has. And in some cases, it makes sense to have these things tie together. I've also worked on our scholar drivers, some work on the Java drivers as well. What I want to talk about a little, partly because I inherited a colleague's proposal who couldn't make it and sort of jumped in with what he had submitted, is a little bit about sort of the promises of big data. What we mean by big data and what we need to worry about. As well as looking a little bit at how Mongo breaks up data in sharding, for example, to understand how we're approaching big data. And understanding a couple of the concepts that sometimes float around and not everyone is seen like MapReduce. So here's the big thing that we need to be thinking about, which is that software is eating the world. And this was a quote from Mark Andreessen. He knows a little bit about software. He was on the team that created Mosaic, which was the first graphical web browser, as well as founding Netscape. And now he does venture capital. And he wrote a great article in the Wall Street Journal that's linked here about a year ago, where he talked about this idea that software is eating the world, that everything is software now. And there's a lot of different concepts that come from this, but we have to keep this in mind because this is what's driving big data. So the first thing is, think about Amazon. Now I don't know at what point Amazon launched in Europe, but I remember maybe 1995, 1996 starting to use them in the US. And all they sold was books. But if you read a lot of science fiction and other things that you can't always find stocked in a regular bookstore, it was amazing you could find any book you were looking for. But as things expanded, what they found was that their model worked really, really well for everything. Now they sell bicycles, televisions, computers, anything you need. But they're even getting to the point in some cities in America that you can get home grocery delivery from Amazon. The reality, though, is that physical goods aren't as much what they're doing. Think about all the other things that people know Amazon for. One of them is eBooks. They're pushing the Kindle. And Mark Andreessen's article talked about this, the idea that Amazon is changing the book with software, by going to the Kindle, by having eBooks and other things. But the biggest one is probably EC2. All of Amazon's public cloud services that run so many internet sites now. And if you doubt that, just wait until the next time EC2 goes down and see how many internet sites you can't use, because there's a lot of them. And what's worse, it's usually the ones you waste time on while you're waiting for code to compile, and they're down. So keep in mind things, though, as well that Amazon created EC2 is a response to when they had rushes in traffic around the holidays, they built all these systems to use these extra computers and turned around and offered those out, and this is all software. So this is a company that started with physical goods, but the software they created became as valuable a commodity as what it was that they were selling. Now Netflix may not be as familiar here in Spain. The guys here that lived in the US for the last couple of years probably know them. Netflix actually started delivering DVDs to the house. So when Netflix started, DVDs were still not too popular. A lot of the rental stores didn't have them. They still had videotapes. And so the whole idea is you can order Netflix. They'd send you DVDs in the mail. You keep them for as long as you want it. That takes a lot, though. Think about all the inventory you have to keep. Really popular movies. You don't want people on a waiting list. And so they started delivering DVDs. But as they've grown, they've actually shifted to an online service. And they've launched that online service in Europe, in countries where they never, ever shipped a single physical DVD. They're able to basically turn on, turn on a switch, and run this software for the world. So, I mean, they've rolled out in a bunch of countries. They're in Ireland, the UK, Canada. They've rolled out in the Nordic countries recently. And there was a big snafu because apparently the subtitles they were using were from pirates. And so suddenly physical inventory and postal distribution is gone. But some of this starts as software that's needed to track millions of DVDs that you're sending out across America. I mean, Netflix's big thing was essentially a hack on the US postal system. They found a way to very efficiently get DVDs to and from people. But now what they're doing is very efficient software. And their whole infrastructure is built on EC2, so we're almost building on Amazon. But also think about all the data here. So they're collecting lots and lots of viewing habits, because when I log in to Netflix, based on other things I've watched, it suggests to things to me that I might like. And that means they're collecting data not just from me, but from everyone. And it's kind of creepy when you think about it. But even think about Disney. And this is another one where I'm shamelessly stealing an example from Mark Andreessen's article because I think it was really good, although he looked at it somewhat from an investment standpoint. The points are valid as an engineer. Disney, traditional animation company, started like this. I couldn't find a good picture of a Disney animation sweatshop where they used to have entire warehouses of people doing this. But if you can see the picture, she's basically coloring in Pinocchio here. And this is how Disney used to make movies. And it took a long time. You draw each cell. You have somebody fill them in. If you've got to make changes, it's rather significant. And what they found suddenly in the late 90s is companies like Pixar were eating their lunch. Suddenly, these companies are doing everything in software. Now, it might take a long time to render a Pixar movie. But think about how easy it is if you want to change a major character's appearance in an animated movie that's done on a computer versus drawn by hand. And so Disney's response was to transform themselves to a company that's built on this. This is actually a picture purported to be of Pixar's server farm. There's also a cool picture floating around. They have a neon sign that says Render Farm. But that's what happened. They bought Pixar because they couldn't compete. And now, I mean, the most significant Disney movie of this year that I can think of was Brave, which was a Pixar movie. It's all computer animated. So what does this force us to do? Software is eating data. Software doesn't start from nowhere. All software needs some kind of data to get it started. It might be user input, email, text files, whatever else. But all things that eat eventually have to excrete. And this is part of the problem that we have. So ingestion is going to eventually lead to excretion. So if we start with grapes, and we add yeast, and my build's running very slow, we add yeast, we get wine. Or depending on how bad a job we did vinegar. So I mean, yeast actually eats sugars and excretes ethanol. I actually started brewing beer at home this year. It was very fascinated by that process in that the amount of sugar that's in the beer when you first start before you add yeast makes a big difference in what you get when the yeast is done. Now a cow eating hay, well, I think we can guess at what happens here. And look, in some cases, there's lots and lots of enterprise companies that are pushing solutions that they've taken the stuff that they've had for 20 years that was inadequate to small data and rebranded it and told us that we can have it for big data. And that means that we actually have to be discerning consumers. We have to understand what it is that we're buying. We have to look at everything. And this is up to us. I mean, I was at the Hadoop Summit a few months ago and saw these traditional companies that I was shocked to see there that suddenly were just pushing Hadoop solutions. Companies that you wouldn't imagine were doing it because they were all trying to ride this. So obviously, cows excrete as well. So software eats our data. As I said, we feed stuff in. The question is, what does it excrete? More data. And this means that the data gets bigger. And the solutions become narrower. So we just talked about this idea of, well, what defines big data? And I think there was a really good answer. And big data is that point at which you can't work with it anymore with your traditional systems. And I would think of it as exactly the same thing of when the data gets so large that you can't store it or process it with the tools that you usually have. And this means software gets fed data, excretes data. We feed it back into software. And it's an endless cycle. Going back to alcohol is a great example. I went to a whiskey distillery last year where whiskey, the most important ingredients, are water and malted barley. And after they finish with the malted barley, the husks of the barley that are left over, get carted off by farmers and fed to cows, who ultimately fertilize the fields that the next batch of barley grows in. And it's this. It's this kind of a cycle. A farm has a similar cycle that everything is feeding each other. But this data gets bigger and bigger and bigger. And eventually, we may end up with cow excrement again. So there is a big market. And there's a lot of solutions for doing this stuff. We've got data warehouse software. And this is growing in systems with which we can store lots of information. We've also got operational databases. And data warehouse is optimized for extracting lots of information and storing it quickly, but not necessarily for being the thing right behind our web interface. An operational database is what we're really working with when we're building applications that customers use. And so there's lots of old systems being upgraded. There are solutions out there to add big data and MapReduce and all these other things to old databases like MySQL and Postgres and Oracle's has their own solutions and all these other things. So we're taking the old tools. And in some cases, adding things. And in some cases, just rebranding them and retraining our salespeople. You've also got NoSQL. And you've got things like Cassandra, which is probably the first one that I really heard a lot about and saw people doing big things with. CouchDB was in there as well. And there's Mongo as well. Now one of the things is Cassandra has lots of integration with Hadoop. And the data stacks guys have built all sorts of things to build a better solution to integrate their database in Hadoop. And it comes back to the same thing, play on your strengths. With Mongo, we've also integrated with Hadoop. And we've got lots of solutions for scaling very readily and very easily and dropping in in the cases where people traditionally are using MySQL or Oracle, sliding in in a way that makes sense. You've also got platforms. And the most prevalent of this is Hadoop, obviously. Hadoop is a platform for doing big data. There's a storage component. There's a calculation component. And we can make this work with all sorts of other tools. So now, it's important not to tilt at windmills. And I'm probably at least somewhere where you've all heard of Don Quixote. In fact, there's an English word called Quixotic, which essentially refers to being, I don't know what the right definition is off the top of my head actually, but it's the same kind of thing, being overly optimistic and jumping into things without really thinking about it. And that applies here because, again, we've rebranded everything and we're thinking about our solutions. So it's very easy to get distracted. And this comes back to, well, how do we decide how to adopt things? How do we decide when different departments have picked different software? And the answer is, at the end of the day, you need to be able to justify what you've done. And that means that you've got to keep it simple. It may not actually be simple. You might be storing six terabytes of data. But fundamentally, it has to be simple with these rules, which is that the tools have to be something that not only you can understand, but your team can understand. Because if you go on holiday, you want to go on holiday and not have people calling you because the system is down and you're the only one who understands it. And early in my career, I made that mistake more times than I can count because it's easy to confuse indispensable with being the person who takes control of everything and doesn't let anybody else touch it. That also means that the tools and techniques have to scale. That means that you should be able to start small. You should be able to do that first small project. And maybe it's something you do on the side and it turns into something big. Twitter started as a side project in another company. And it turned into something that became much bigger. And so if your architecture that you choose early on can't scale with you, you're going to have a really, really tough situation when you become popular because you're going to have to stop and redo everything. And it's important not to reinvent the wheel. I think this was rehashed as well during our roundtable. This idea, it's so fun to build something from scratch, to be convinced that we know better. But there's lots and lots of smart people with too much free time who've built open source solutions that solve problems really well. The question about things like integrating Cassandra and Lucine, where there's also, we do the same with Mongo where a lot of people are using solar or elastic search with Mongo. Because Mongo lacks full tech search right now. And Lucine and solar and elastic search are really, really, really good at search. So instead of having everybody build their own search solution, they find a way to make what they already have work with something that's really good at what they're doing. And that's important. Reinventing the wheel is only going to get you further behind your competitors. And finally, don't bite off more than you can chew. Because this happens as well. And this comes, I mean, this is partly with projects, but also actually how we deal with our data, which I'm going to talk about in a minute. But we need to break this into smaller pieces. You're probably not going to fit a whole pig into your mouth. So you're probably going to slice him up into tiny pieces and eat him one bite at a time. I think this matters. And it applies to a lot of different things. But it's also fundamentally how a lot of these big systems are dealing with things by spreading them out in many places and the pieces of data becoming small. So looking at big data, and this is a little bit about how Mongo in our sharding system, which we use to provide big data platform for storage, solves these problems a little as well about how we use this to integrate with our Hadoop solution. But big data, I probably would cross gigabytes off here. Because if you've only got gigabytes, it's adorable. But you probably aren't quite at big data scale yet. But when you get into terabytes and petabytes and exabytes and whatever else is out there, it becomes a very, very different story. And the fact that you can walk down to the store by a terabyte hard drive is pretty impressive these days. So I mentioned before, I think a system, a good system, should scale with me. When I grow my system, when I move up, when I move down, whatever I need to do, that I should get a uniform view. I shouldn't have to rewrite from scratch every six months because I've doubled my user base. My concerns when I build this kind of system, and it applies to things like the Hadoop connector here, can I read and write this data efficiently at different scales? Because if you can only read it or only write it, you're sort of in trouble because you can only do one thing well. And writing a bunch of data that you can't get back out isn't gonna be very useful. Reading a bunch of data when you can't keep up with the level of writes that you need to do is even less useful. And it also matters that you can calculate on large portions of this data. It's often not enough to just query, but you have to do aggregations, calculations, whatever else you need to do. So a lot of systems, Google file systems, probably the core example because so much else is based on it. A lot of the big data systems are sort of chasing Google white papers and that Google puts out white papers about what they're doing but no code, and everyone tries to do that. It comes back to not reinventing the wheel. Fortunately, Google's always two steps ahead. So Hadoop's file system, HDFS, is based on Google file system. Mongo Sharding has a lot of concepts from the Google file system and Bigtable as well. And the idea is to break these problems into chunks. These are definitely not the only systems that do it this way, but they're two prominent examples. And since I'm talking about my work to integrate the two, they work well. They're actually very similar. So you break this data into smaller chunks. Now the default in both Mongo and Hadoop is 64 megabytes, which means that when you go over 64 megabytes, you break it into more pieces. And you can spread this out over many, many data nodes where each node can contain many chunks. And if a chunk gets too large or a node gets overloaded, then you can rebalance your data as you need to. So now of course the question is what am I talking about? Because I've gone from pictures of Homer Simpson to being a little more serious. So a chunk is a range of values. Now in Mongo, this is an example that would apply to Mongo because I think chunks get confusing for everyone. They were confusing for me early on and we make mistakes if we're confused. This comes back to keeping it simple. It also puts an onus on us as project owners to do better documentation. If I was to start, if I had a username collection and there was no data in it in Mongo, and I sharded Mongo and I said, I want you to shard on the username. What I'm gonna get is one chunk to start with, which is this chunk here. It's minus infinity to max infinity. So the minimum possible value to the maximum possible value. Now as I continue to add data, when I hit that 64 megs, I'm gonna split it into more pieces. And so as I add data, if I add bill, then I'm gonna get more chunks. I might have minimum to B and C to maximum. So suddenly we've got a very different picture. We've now got two chunks. And what these are, they're metadata. So a chunk could be assigned to a server. And so we say anything between minimum and B exists on server one. Anything between C and maximum exists on server two. And we can keep adding data. So if I add Becky, I might break it even smaller. So now I've got minimum to BA and BE to BR, but then I've also still got a C chunk, C to maximum. And we get smaller and smaller. So individual letters, partial letter ranges, et cetera can happen. Eventually, we'll get to the smallest possible value that we can have in this chunking system, which would be one unique value. So eventually there's no range anymore because well Brad is so large a chunk that we can't split him into smaller pieces. Now there's a problem here in certain systems like this. Mongo is an example. It's why we have to be careful about how we split things because at this point it doesn't matter if there's 64 megs or 64 terabytes for Brad. We can't break him into a smaller chunk. And so often what you'll do is add something else like a timestamp or some other key here that we can break into smaller pieces. So if I'm building Twitter and it's username comma timestamp, well then when I get down to Brad I start breaking Brad into chunks by date or whatever else. And so this represents our ranges of information and we can use this to spread across the cluster. So we have a large data set where the primary key is username. One thing I can do is simplify this. Let's break this into chunks by letter. Now I realize I just said a chunk is not necessarily a letter. So you can think of B as being BA to BZ or whatever else you want it to be. But this is a range we've got a couple of chunks here to work with. In Mongo, your chunks might not get arranged in alphabetical order. So it might not be chunk A, chunk B, chunk C. It could be random. Within the chunk we maintain order and within the metadata we do. But ultimately we wanna represent these chunks to let us scale. We want lots and lots of data nodes which means the more chunks we have the more nodes we can support. So if I have four data nodes to begin with and I lose my battery here. If I have four data nodes to begin with and each of those is 25% of my chunks. Where 25% of my data spread across four servers. And this lets us scale. So I've got four nodes and I evenly distribute. And so my chunks go out across each of these four nodes. Now it was a nice even division here. There were 16 chunks and four servers. What we're always looking for though in these kinds of systems and in Mongo particularly is equilibrium. I want an even distribution. I don't want one of my four nodes to have 90% of my data. Because think about all the problems that I have now. I can't keep up with writing. I probably have overloaded my resources. So you want the system to balance this out for you. And in Mongo's case we do balance this behind the scenes with an auto balancer. And so if I add nodes or even remove nodes I can also balance. So if I add number five here we can move a couple chunks onto him so there's evenness. And if five somehow ended up with 15 chunks we can move those out to the other so everyone is sharing the workload. So the answer to calculating big data is very similar to what it is in storing it. Which is that we need to break our data into bite size pieces. And we need to build functions that we can work together with repeatedly on pieces of our data. In one way or another. Now in Hadoop you can write pure MapReduce jobs but there's also many tools like PIG and other things which hide MapReduce from you but under the covers it's breaking everything into small pieces. So we process this data across many nodes and we aggregate the result into a final set. The pieces here that I'm talking about are not actually chunks although it depends on the system but they're the individual data points that are in the chunks. So a chunk is a range. It might be Brad to Brendan and an actual piece of data would be the documents representing each of these users. But the chunks are a really useful unit to data process. So if I'm looking at say Mongo and Hadoop integration then the chunk is a really convenient piece of work that I can use to say here's a portion of my data to work with. In Hadoop these are called input splits. It's a piece of data, a portion of data to work with and regardless of how you feed your data into Hadoop you use input splits. So in the case of if I'm storing files on HDFS it'll still break my file into a couple of pieces and feed input splits in for calculation. The more input splits we have the more jobs we can run. So the most common application and I mentioned MapReduce is this is that MapReduce is what we're often using now. Some of this is moving away. In some cases I think people have come up with better solutions for doing this. I'll talk a little about some of the things happening in Hadoop and other things. Mongo has introduced an aggregation framework recently which instead of using MapReduce we use a pipeline actually very similar to APL. I don't hear anyone shuddering in the room so presumably you've not been exposed. This is based on a Google white paper as well. It works with two functions, MapReduce. Now this isn't a new invention, this isn't reinventing the wheel. MapReduce are functions that have been in LISP since the 60s. It's taking those concepts and applying those across many data nodes. So the idea is for us to effectively process data at different scales and have composable functions that we can repeatedly use so we can scale our results. I can add more nodes and these functions can be spread out. And Hadoop is obviously built around MapReduce for calculation right now but Mongo also can be integrated here as I mentioned. We don't use the HDFS storage, we actually pipe this data directly in to Hadoop's MapReduce engine. And MapReduce itself, we're made up of a couple phases here. The primaries of which are Map, Shuffle and Reduce. So if we look at a typical MapReduce job very quickly, email records, we want to count the number of times that one user has received an email. We start here with a map function. So I've got a bunch of fake emails, two from and subject. And the idea of my map function in MapReduce is to break each document into a key which is my grouping and a value. So the grouping is like if you did a select by group in SQL. It's that piece of information that's unique. And so my map function will take a document or a piece of information in, extract my key. And so in this case, the email to Tyler becomes Tyler with a value of count one. And the same applies for all of the items that are in this MapReduce job. The next step that happens in most frameworks, this is automatic, but this is to the group or shuffle step where we group like keys together. We want to create an array of their distinct values. Different systems will handle this differently. In Hadoop's case, it will pick one machine that's responsible for a key and send every value for that key from across the cluster to that one machine. And so I would collapse Tyler down and Mike down and Brendan down into something else where it's an array of every value that I've seen for them. And then ready these for the final step, which is reduce. And with the reduced step, I want to flatten that array or that list down into one result. And so Tyler becomes a count of two and the same for Mike and the same for Brendan. So MapReduce gives us a pretty effective system for calculating this. We can get lots and lots of information. The downside is you're not going to guarantee you get the information quickly just that you get it eventually. It's supported in a lot of places though. It's sort of the assembly language for many systems like Hadoop right now. So we have answers for both of these concerns or rough answers, at least to think about how we might do this, which is how do I read and write at scale and how do I run calculations on big portions of this data? This isn't the only way to do it, but it's one solution that's become very popular in the last couple of years. MapReduce is sort of everywhere with Hadoop and everything else. But batch is not necessarily a sustainable answer because that's really what MapReduce is. It's a batch process. You don't get a guarantee of a quick answer just that no matter how much data you throw at it, you will eventually get an answer. But it's a bit of a catch 22. You know, we can get answers from petabytes of data answers that we never used to be able to get. The downside is we can't guarantee we get them quickly. And this is tough sometimes when we're designing systems, testing systems, we're trying to even test hypotheses. It's a bit of a step backwards because okay, we may not have been able to handle petabytes of data, but we had this momentum in the industry where we could get answers fairly quickly. And now suddenly we can handle petabytes but not necessarily get quick answers. We may be okay with this, but the people who pay our bills often aren't. And they want answers, and this is changing the way that we do things. So we have to evolve. And that's happening a lot across the industry with this kind of processing. The big data world is moving from slower batch process solutions. And Google has moved away from batch quite a bit into more real-time things and things like caffeine and Dremel and all these other things. And there's typically Apache projects that emulate many of these things when there's been white papers. Hadoop is sort of moving away from MapReduce as their assembly language and moving to this resource management system called Yarn, which stands for yet another resource negotiator. So now you can ask for storage nodes and job nodes and other things, but you don't have to write everything in MapReduce. Instead, MapReduce has been implemented on top of Yarn. And this is gonna change the picture because now Hadoop is a platform is not just a platform for running MapReduce, but a platform for running whatever we need across large clusters of computers. And we can really build anything we want. I think Yarn is still young. There's been a bunch of great articles on Hortonworks website about the different things that you can build with Yarn and some examples, including something I believe called Kitten, which was doing some basic work but demonstrated it. And you've also got systems like Spark and Storm and other similar platforms that are really built around real-time and different sorts of ideas. The downside to some of these real-time systems is that they can only respond to data being fed in, but many of these are integrating with many different layers. So quickly in closing, I'm pointing out the world is definitely being eaten by software. Whether we want it to or not, everything is becoming software and software is evolving to take over businesses that might never have thought they'd be software businesses. It's leaving behind a lot of data. And we have to be careful not to step in that data, which means we have to understand what to do with it. More data means more software means more data means more software and it never ends. So we need practical solutions to process data, to store data, and this is what's gonna save us or at least get us a little closer to being out of damnation. And as data scientists, technologists, whatever we wanna call ourselves, we need to evolve our strategies. We need to think, we need to update our tools. We can't just stick in one way of doing things because the data is gonna keep growing. A petabyte or an exabyte might sound like a lot of data today, but the first computer I worked on 20 years ago had a 30 megabyte hard drive that I never quite managed to fill up. So eventually you're gonna get there and that means that your tools are gonna have to be able to get there with you. If you're interested in more about the Hadoop connector that we provide, it is a GitHub as well as the docs for that. I should point out as well, we do have an office in Barcelona now, we have a couple of people working out of it, my colleague Bill here at the front of the room works out of the Barcelona office. So we are doing a lot of work with companies in Spain. We do, if you have questions about Mongo or other things we have people available. And I'm happy to answer any questions you have. Anyone? I don't know how, I don't think throwing this at you is gonna work real well. I can repeat the question. So we've got input 400 feed directly in. So the input splits are just the addresses of servers, area to run. And then the actual mapper jobs would use the input format to feed that data into a mapper. Same for the output it would be able to feed back out to Mongo. Other questions? I always feel like I'm about to say okay great and there's one person hiding in my blind spot who's all upset that I've ignored them. So if you're that person, yell out now. Otherwise thank you everyone very much for your time.