 Really folks, first of all, thanks a lot for having me, particularly to InMovie. I'm really glad to see so much interest in Bangalore. It's very exciting. Alright, so I changed the topic a little bit. I added a liberal marketing, so please forgive that. I'm from Artworks, how many of you are from Artworks? That's more than I expected. Alright, so, brief introduction, I'm a founder and architect of Artworks. I lead the MapViews Dev at Yahoo. Formally, I was the architect in the lead for MapViews Dev at Yahoo. My primary responsibility was to run MapViews of service for all of Yahoo. Fort, configuration, Ops, DevOps, whatever. In ASF, we typically call it wearing multiple hats. My ASF hat is the VP of Apache Hadoop, which is the chair of the project management committee for Apache Hadoop. I've been doing full-time Hadoop for a little over six years now, pretty much since the project started. We have a long-term governor and PMC member. And, usually in this context, I'm also the release manager for Hadoop. Alright, so before I jump into either hot loads or Hadoop, I just want to help set the context a little bit. Everybody has heard the term big data. Be careful, because big data is actually a trademark of some company. So, you'll want to. Alright, so why is big data so important? So, if you look at any McKinsey, Gardner, IDC, forest report, what basically people are saying is that the amount of data that is getting generated is kind of overwhelming. Overwhelming all existing systems. And this is not a web 2.0 kind of phenomenon. It's not Yahoo, Google, Facebook, in-movie kind of world. It's everywhere. If you look at banks and securities and investments, you look at retail, these guys are all drowning in data. So, what do they do with this data at this point? Any idea? What do they do with this? They just drop it. They throw it away. They're like, it's such a big problem. I can't deal with it. I'm just going to forget about it. So, that is essentially what is happening across the spectrum. I mean, you look at all of the enterprise, listen exactly what's happening. So, the answer and actually this is from May 2011, which was 12 months ago. You look at this, the amount of data is actually going at 800, 900 percent per year. And most of it is actually unstructured, 90 percent of it is unstructured. That's another problem. All the existing systems were primarily designed to deal with structured data. So, clearly all the web portal companies were the binaries here. And even among them, the biggest mistake was actually web search. Most of the Yahoo! Hadoop team actually came from web search. And that's an interesting tip because that gives you an idea of where the problems were. And when you think about web search, it's as big as any you can think of. I mean, I keep saying that web search is the quintessential big data problem. And if you think about it for one second, if you want to do web search, what do you have to do? You actually have to download the web. This is not big content. This is not peer-to-peer or caza or whatever. You've got to have a copy of every single web page in the world. Think about that for a second. Every single web page in the world, you won't have a copy. But wait, one copy is not enough because you want multiple copies. You want multiple copies because you want to see how the web is changing over a period of three months or six months. So even when we started six, seven years ago doing Hadoop, our goal was to keep about 10 copies of the entire web. I just told it. Now we think about processing. That's fine. It's great to have a copy of the web, but how do you actually implement search on top of it? To implement search on top of it, you know, how many of you have heard of PageRank? Now, what does PageRank work? PageRank works by trying to analyze the links and it tries to figure out the importance of the link. So to do that analyzing the importance of the link, what you have to do is, the first thing you need to do is actually build a graph, right? And in that graph of the World Wide Web, what you're doing is every web page, every URL becomes a node to the graph and every hyperlink becomes an edge. So if you go to yahoo.com, yahoo.com probably has, you know, at any point it probably has like 300, 400, 500 links. So you have to keep all of those links. You have to build a graph with that. And if you look at the incoming links to yahoo.com, they're probably like billions actually. Think of the number of web pages in the world which either have a link which is yahoo.com or Google.com. Right? That is the graph you're trying to build. So even in 2006 when we started doing Hadoop, right, what we were trying to do with Hadoop was to replace existing systems which did this with Hadoop. And even in 2006, we had 100 billion URLs in our database. I don't mean like MySQL, I mean like in the case of a database, right? So we had 100 billion URLs and we had over 10 trillion edges actually in that graph. So we have a graph which has 100 billion nodes and 10 trillion edges, right? So that is pretty much the definition of big data. So that's the context and that's where we started off. And we went out, right? We kind of went out and we said, all right, next what we do, we do advertising optimization. We do mail and expand. We do user interest prediction, data mining. Anything you can think of kind of went from there. But all of them still at this point are, you know, they still are not at the same scale as web search. So the fact that, you know, companies like Yahoo and Google have to build a system like Google, right? Because Google is not a proprietary store. It gives you an indication of where the pain points go, right? Essentially they're web search companies, right? And that's the context for big data, right? All right, so that was how we go there. But how do we, you know, to make it even more real from a non-website perspective? What would we try to do at Yahoo with Google, right? Our simple use case is, you know, personalization of Yahoo.com, right? So if you go to Yahoo.com, it's not a static page for it. You know, five years ago it was. Everybody in the world, given the same set of geographical... So if you were in California, you were in Bangalore, right? Everybody in Bangalore would see the same Yahoo.com web page. The only customization was dependent on the geography. If you were in California, you saw something, if you were in London, you saw something else. In Bangalore, you saw something else. But today, it's completely different. Yahoo has, I believe, something like 800 million registered users, or probably a billion registered users, right? Well, every single one of them is personalized. So you, as the user, are not doing any work, right? You're still going to Yahoo.com, but it's getting personalized for you. And how do you do this? It's actually a good thing. Right? So if you think about it, it's actually pretty cool to take on under the hood. Next, Yahoo Mail, right? So Yahoo Mail delivers, and this was like slightly older, is probably delivers close to 10 billion emails a day. And 90% of it is spam. Right? If anti-spam didn't work, email would be completely unusable. So again, Hadoop is the answer, right? Hadoop does all of the back-end processing to actually prevent anti-spam for you as the user. Right? And that's, again, a good use case for Hadoop. All right. So all of this was back to Yahoo, and where are we today? Right? Hadoop is pretty much everywhere in India. Right? Sometimes to Microsoft, the University of Nebraska, and made a living. It's pretty much used by everybody else. I mean, I was saying this thing where E-Harmonium claims that, you know, 20% of U.S. marriages are because of Hadoop. Pretty cool. Right? Well, my really interesting use case were the use Hadoop to predict earthquakes. Right? That's a really nice use case. It's really satisfying. Okay. So as we go forward, right? This is where you kind of see big data and how it's getting used across industries. Right? Healthcare. I can't name them, but I talked to at least two different, you know, two different startups in the last couple of months. What we're actually trying to do, personalize medicine. Right? What they're trying to do is actually take your data, all kinds of data, your DNA, your blood sugar, your heartbeat, whatever it is. Right? And they're trying to personalize medicine for you. They're trying to create molecules. And they're trying to personalize medicines for you as they do it. Right? That's a big data problem. You know, for a single user, it's not a big deal, right? A single user is probably 10 to, at most, 10 gigabytes of data. But they want to scale. They want to do it for everybody in the world. And that, if you multiply, you know, 10 gigabytes by 6 billion people, that's a lot of data. Right? That's a big data problem. Retail, right? Everybody's trying to figure out what is the best placement for this product. One of my favorite examples is, apparently, Walmart places, on Friday afternoons, they place diapers and beer in the same oil. Think about it. It's diapers and beer. Because they expect single fathers to go buy diapers on Friday evenings. Right? That is the kind of impact doing analytics has on your business. Right? And that's what all these guys are trying to do. Right? You know, financial services, you know, Visa is trying to figure out whether your credit card transaction is fraudulent or not. A simple example is if you do a credit card transaction, you know, greater than, I believe it's like a few hundred miles or a few thousand miles, whatever. If you buy something in Bangalore and the next second you buy something from Delhi, there's a very high chance that it's a fraudulent transaction. Right? That's a big data problem. Right? You multiply in Visa probably has, I forget what the number is, some like 60, 70, 80% of all transactions will go through Visa. So think about all, 70% of all credit card transactions will go through Visa. Right? So that's a big data problem. So the context in terms of that is that big data is that we really just don't recognize it. Okay. I'll just run through this. You guys probably already know this, so I'm going to skip this. What is Apache Hadoop? Just thing I want to make sure is when I say Hadoop, I don't mean, in the rest of the presentation, I just don't mean core Hadoop, which is HDF and MapReduce, but I mean the whole credit ecosystem. Right? This is one of my favorite slides where we talk about what is known as the Apache way. The big thing in Apache, I mean of course Hadoop is a, lots of Hadoop success comes from the fact that it's part of the Apache Soccer Foundation. Right? The big thing in Apache is that it's called a community over thought. What it means is, you know, if somebody gets hit by a bus with a mean or a shout, or if somebody gets hit by a bus, nothing happens. Right? That's why community is important. It's got a really big open source community. And of course, as part of the important works in Yahoo, we're really proud to be the by far the majority contributors. So, just a little bit about the important works. You guys probably know this, but you know, early last year, yeah, I would say very last year, Yahoo decided to spin off the Hadoop team. And of course Yahoo embraced Hadoop, six years ago at this point. You can't, you know, epic amounts of data. I mean, this was basically the absolute problem. Slightly dated, but at this point, Yahoo is close to 50,000 votes. It's close to like 10 million jobs per month on Hadoop. And it does more than 10 petabytes of value per day per cluster at any given point. So that's what it is. So Yahoo then, early to mid-2011, it started to spin off most of its Hadoop team into hardworks. Yeah, that's a good article from wire. All right, so what are we doing as hardworks? Our vision is we really think that by the next four or five years, at least half the world's data will be touched by Hadoop, whether it's storage or processing. It's more, but realistically speaking, there's almost inevitable at this point because there's simply no other option for this kind of scale of data storage process, right? Which is really cool. So how do we achieve that vision, right? It's a big vision. What do we achieve it? What we want to do is enable a big ecosystem around this and so I'm just going to be, you know, hardworks or Yahoo or whatever. Okay, so it's going to be a big ecosystem. It's going to be, you know, you guys, right? As users or vendors or part of the community, that's what we're trying to get there. Okay, so what are the challenges, right? The challenges are at this point, you know, 1, 2, 3 are lack of talent, right? Little. At this point, it's lack of talent because Hadoop is a new system, right? It's absolutely new. It's not replacing, one of the really cool parts of Hadoop is it's not replacing any existing technology, right? You'll think about, you know, YSQL, it replaced an existing RDBMS, right? Or at least attempted to replace it. You think about Linux, it attempted to replace an existing operating system which was used, right? But Hadoop is not replacing anything, right? Before Hadoop, there was nothing else, right, in this space. And that's why it's a problem for us. It's a problem for, you know, the ecosystem because it's actually brand new, right? What we really need is, you know, this notion of Hadoop DBS, right? You know, everybody knows what a DBA is, right? Somebody who understands the database was actually an exporter dealing with the database. So what we need is a concept of Hadoop database, a lot of DBA, right? I mean, I feel it as a throwaway, right? Hadoop is not a database, but still we need the concept of a DBA and Hadoop who's, you know, reasonably exploiting it. So what is the status true at this point, given the challenges? In spite of all the challenges, almost every single watch and find rate at this point is as up to POC, right? Which means it's a really useful technology and it's actually making a real and meaningful difference to these guys, which is why they're using it, right? In spite of all the challenges and, you know, frankly, even the maturity of the system, right? So how are they using it? The wide, in terms of data, is being created by, kind of, pretty consulting firms. I mean, we have hard words actually dealing with a lot of them. We can introduce you to lots of them if you want. They're niche and expensive, right? Yet they're very, very profitable. The reason they're profitable is because demand kind of far outstrips the play. Right? It goes back to the balance problem. And, you know, the big system integrators, ebaypro, enforcers, DCS Accenture, Capgemini, these guys are still missing, but at some point they'll be there, right? That's a pretty big opportunity. So what do we do, right? So what we want to do as hard books is we want to provide technology leadership via open source, right? Everything we do, every line of code I've written for six years now has been into open, and I really enjoy doing that. And frankly, at this point, you know, enterprises are actually much more open to using open source technologies than they were even two, three years ago. In fact, you know, things like the governments, right, the British government and even the Indian government, for example, they're very, very keen on replacing their customers with open source, primarily because it takes away the concept of locking, right? Because there's no locking, they can actually freely move beyond vendors. When they actually can move beyond vendors, it means it increases the competition, and it means you as the customer will win, right? So that's the economics of it. And, of course, what you want to do is, you know, enable the ecosystems that we actually, as part of hordnance, we want to get pushed. What I mean by that is, you know, I feel it calls on a regular basis by, you know, a large bank or an insurance firm saying, look, we really like to do, we want to help us use it. It's not like we are going in and telling, you know, a big bank that this is what it does, you know, you want to use it, right? They are the way around it. They are calling us off, which is a good problem. And the way we want to go about as part of hordnance is we want to have a consumable, a fully featured, a completely fully featured and consumable standard Hadoop stack and a road map which is open for everybody to see, right? The reason we want to do this is not, you know, not really quality, because it makes good business sense, right? What you are going to do is actually open up the market. If there is proprietary technology, like open core, what is known as the open core model, right? Most of it if you really want to use it, you have a paper, right? We don't want to do that. Because if we do that, what we have seen is people still are afraid of locking because of that proprietary stuff around it, right? Even though the core is open, they might not be able to use it at all because the stuff which is really important is around it, right? And it's not free, right? So you have competitors who do that, we don't want to do that. And this way we also share our, you know, road map system. All right. So, you know, I'll move quickly into this. So, as we see it, how does Purdue fit in? You guys are all, you know, using Purdue, but actually just putting context for somebody else, right? Today this is what Enterprise has, right? They have essentially three types, right? Three disparate systems. They have serving applications, which is, you know, web applications. They have traditional VI and data warehouses, you know, EDW, data markets, VI analytics, and they also have unstructured data, right? These are now three different systems and they've managed differently, right? What is also important to understand is it's not just, you know, this one triangle. In a typical Enterprise there will be 10 copies of this triangle, right? Because every business unit within that Enterprise will have the same stack and they don't talk to each other, right? So, you know, the mortgage department will be separate from the data market in the credit card department, right? And they can't talk to each other, right? That's a big problem. So, what we see Purdue as is that so we see it in the kind of becoming the connector to all of these systems and it's connecting not just in, you know, the 2D, it's also connecting in 3D. Three dimensions, right? So, it's connecting across your business units and that's a really big deal because you finally have one place where you can actually look at all your data and that's, you know, Purdue, right? And again, we talked about this, right? God knows things that will be like 800% data both year on year and 80 to 90% is actually unstructured, right? I mean, if you think about it, all you guys have a smartphone, right? Everybody has a smartphone. Now, the amount of data that's being generated from a smartphone, whether it's the web, you know, whether it's the websites of browsing or the email that you're reading or whatever it is, is all data, right? We call it the digital exhaust. Now, people are trying to figure out how to, you know, do better with that data and all of that is actually unstructured. It's not rows and columns, right? That's a big deal. And sort of broader context this is what it looks like, right? There's data, there's apps and there's operations. So Hadoop, that's the triangle, right? Right. So to address these challenges, we have what we call automotive data platform or HDB for the short term. Like I talked about, this is the consumable and standard Hadoop platform and protocol, right? This is a, you know, a brief overview, right? It's got all the stuff you're probably familiar with. The interesting ones are from what we see is actually things like Edge Catwork. How many of you have heard of Edge Catwork? Not many. So Edge Catwork is actually one of the coolest technologies which people don't know about at this point. What Edge Catwork does is it provides table and metadata management information. HDFS is where you show your data, right? Or HBase. HDFS is just files and directories. You don't know any metadata. You don't know what are the rows there, how many rows are there, how many columns are there, any of that information. Edge Catwork provides you that, right? Edge Catwork normally provides you that data for what is stored on HDFS. It also provides you for metadata for what is stored on HBase, what is stored on a traditional RDBMS, for example, an article or a MySQL. So you can now, using Edge Catwork, it also gives you storage drivers and everything. You can now write a simple, map-reduced job, a big job or a hired job which can talk about Edge Catwork, get metadata about the data, like where is what, you don't have to now care about what your various directories are and so on, right? Edge Catwork provides that information to you. You just talk to Edge Catwork and say give me this table for it, right? Edge Catwork will figure out if it's an article or if it's an HBase or if it's in HDFS. It'll also give you the right input format if you want to actually talk to process that data, right? So it's a really big deal. Then there's Ampari which is a management system. Again, that's part of the batching. It's management and monitoring in consoles and GUI and dashboards and all that for installing it to do the full stack not just HDFS or map-reduced. It's for installing the full stack, it's for monitoring it, it's, you know, for alerts and all that. It's got integration with Najuos and Ganglia, all of this kind of technologies. So you can now download this one thing to actually install it. One part is all of this is open source and all of this is a batching. So if you want to modify it you're welcome to. We hope that you not only modify it but also contribute back because that way all of us can win together, right? So it's a standard open source thing. So I'm going to run through this slides quickly. So what we're going to do with HDFS is allow a batcher to be aggressive but we, the batch releases to be aggressive but we take the most table batch releases and they go, they're purely from batches. There's no batches nothing fancy or nothing funky happening there. And we ship it when it's stable. We use our relationship with Yahoo to actually help stabilize releases. We do a lot of work QA and so on to actually do it. But once it's ready we actually ship it, right? Why is it important? Because if you look at individual projects they look like this, right? They're sharp arrows and you can cut yourself, right? So what we do is we tweak the right versions and we integrate them, we test them, we make sure they all work together and then we actually ship it as part of HDFS. Like I said, there's no locking. The cool part is you can actually come to a website to figure out what versions there are and go to a batch and download it. There's no difference, right? It's exactly identical bit by bit. This way there's nothing funky. You know what you're getting is P&O. In terms of the distro model itself we have three, what do you call it, made arrows? Or the bullseye? The center is cool to do or main HTTP which is fully open, fully apache, fully open source. And it's fully supported, right? We have L1 and L2 support, you know, 2 hours 60, you know, 2 hours, 12 hours 24 hours, SLAs and so on. The universe is the non-apache sometimes, open source ecosystem, I mean think of it like Maju or Kandia, right? They're not ASF, but they're submodel. The multiverse actually includes some applications which you download from third parties, for example. You can download from our site if you want. An example is we work with partners like Mark Logic or Informatica. Informatica has this really cool parser for parsing, you know, like 200,000 kinds of data, right? CDRs or health records or whatever. We bundle it. It's optionally installed, but we don't support it. You go get support from the third party, right? So it basically depends on what kind of support you're getting. And we have it right now. We are now in HTTP1, which is 1.x, which is based on HTTP1.x. Adobe Next will be based on HTTP2.x. I'll talk about it in more detail. And then we are. This is HTTP1. You guys are familiar with it. It's based on 2.x. So the highlights are it's the first batch you release that supports HBS security and all that. When we were part of Yahoo, you know, folks like me and Sharad and Shrikant, we all worked in security. We spent, you know, two years working on any of the strong authentication with the Govros. And that's part of 1.x. It's got edge catalog. And this is like, like I said, this version is a quick shift. HTTP2 is based on Hadoop 2.0. I'll talk more about it. Let's quickly go on. Okay. So in context, I just want to set the context of why HTTP2 is a big deal. As you guys know, Hadoop started initially as a part of a batching notch. And then Yahoo featured up in early 2006. When we started in 2006, Hadoop had two modes. You could not run Hadoop with more than two modes, right? At this point, pretty much everybody else is familiar with it. Initially, we did the, you know, monthly releases, 0.1 and 0.2 and so forth. That was like the wrong time ago. And after Hadoop 0.15, we needed more stability because Hadoop started becoming more and more important in Yahoo. So we started with port releases until Hadoop 20 in, actually Hadoop 20 is fortunately or unfortunately still the basis of all the Hadoop distributions you find today. Whether it's Apache Hadoop 1.0 or it's Sirius 3 or HTTP1, they're all based on the same code base, right? Essentially, Hadoop 1 is security for the brand first WebHT. So that's why Hadoop 2 is like the first major release now in since 2009, actually. That's over three years since you've done absolutely major release. And what do you get there? Sirius, I guess he didn't cover Federation, so I'll do it for him. So Federation, the big deal with Federation is it allows us to scale HTFS even more than what it can predict. It does it by actually separating out the namespace management from block storage. The namespace management is things like files and directories, right? Where the files are, where the directories are. That's why it's the name row. So we split that apart from block storage to just figure out where is that replica. We have extra replicas or too few replicas and so on. So this way it allows us to scale HTFS much better. And the reality of what it means is that from now on with Federation, in a single HTFS cluster, you'll actually have multiple name rows like that. Right? This will allow us to scale much more. It's important to remember that even though we have multiple name nodes, we're not splitting the data. We're not partitioning your data nodes. Every data node will actually split off the multiple name nodes. It'll actually respond to multiple masters. Right? This way there's no cycle of block storage. The next one is what I've been focusing on being Shahgad. Lots of people have actually been focusing on is MapReduce 2.0 or what you call is Yarn. Right? So Yarn is an attempt to take her due beyond just MapReduce. Right? What it means is so far, you have data in HTFS or HBase, let's say. The only option you have to process that data is actually just MapReduce. Right? The only algorithm you can run on that data is just MapReduce. You know MapReduce is great. I really love MapReduce. I've been doing it for 6 years now. But unfortunately MapReduce is not the right answer for all the data processing needs. Right? And they recognize that. I mean if you're doing iterative processing for example, you can do it 10 times faster in an alternate paradigm. Right? So what we've done is we've taken MapReduce and we have generalized it to a point where it's now basically like a distributed operating system. Right? The central resource manager is like the brain or the heart or the, you know, the scheduler. And you can now schedule across nodes. I mean think of it as Unix, right? With Unix you can schedule process. What you run in the process Unix doesn't get. Right? It's some sort of instructions. Right? Same thing with, you know, yarn at this point. You can write your own application which can be anything actually. There's a standard API you get. You know, similar to Unix port, you get some standard APIs. If you use the APIs what you do using the APIs is completely up to you. Right? One of the applications is actually MapReduce. Right? You can write different kinds of it. But we've written the MapReduce application. So you can now write, you know, an NPI application or a, you know, a graph crossing application like Jirav or Spark or whatever it is. Right? All of them can run in the same group as well. So it's really a big deal and it's something we're very excited about. And we've actually spent a lot of time on it. We've spent almost two years on this point on this. Right? So this is going to be, it's already part of, you know, reductive. In terms of the MapReduce, you have the framework. You can plug in your map and reduce logic. So in a similar sense, you're saying, I mean, other distributed processing comparisons will have pluggable code. I mean, I just give my my logic and it runs in that paradigm. Yes. So you can basically run, I mean if you're familiar with, are you familiar with MPI or Angular paradigm? Not in detail, but yeah. So you can now run an MPI application within it. Right? So far, you've only writing maps and reviews. Right? You can now run an MPI application, which is very, very different from a MapReduce application. So today, actually what happens is you, in order to solve a problem, you have to think of a MapReduce solution for that. Exactly. Which may or, I mean, which may work, because exactly. So today, your only option is you have to break down your application into the MapReduce paradigm. You have to force-pick it. We are taking that out. What we're doing is we're allowing you to use the most natural way to process that data. So that's the really big part of Yarn at this point. It really opens up the game when it comes to processing data on it. If you guys want, maybe at the end of it, I can do a five-minute deep dive into this to help you understand what you see if there's enough data. Okay, so after that, it's the really, the next big important piece is actually name, node, etcher, and of course I'm not going to waste your time working on it. Performance, right? Everybody cares about performance. With Purdue, we're pretty much getting 2x performance across the board, whether it's STFS or MapReduce, right? So we have things like, you know, HDFS 3Dive performance and MapReduce got, you know, bunch of improvements. One of the important ones is we have, you know, we, and Owen actually broke the data sort record about three years ago. Since this is the first release in three years, all of that performance improvements are finally brought in amazingly. So we get, you know, 30% performance in the shop and so on. Deployment, you know, this is what you want to get to. Probably, all of the things will go fast. What does it take to get there? You know, clearly testing, benchmarks and integration testing and lots of things. We'll skip over this, benchmarks, you probably don't understand this. We kind of, we benchmark pretty much every part of the HDFS and MapReduce pipeline. You know, HDFS is really going through good N9 operations that MapReduce is, you know, scans shop is hard. We have something called GridMix. People are familiar with GridMix? Probably not. So GridMix allows us the ability to take production places and run them in a test cluster, right? That's really important because now we can actually test real applications in a test environment, right? That's what GridMix gives us. Of course, we do integration testing across the stack HPSP, Hive, who's the, you know, you name it? Deployment. We kind of do the Alfa test last year, we're already on 500 nodes. Right now we're in Alfa and pretty much majority of users at Yahoo are actually tested at this point, which is I'm talking about HDFS 2.0, MapReduce 2.0, the whole site, right? We go to beta and then production hopefully in the middle of this year. So I'm going to spend last two minutes on this. What we do is how do we, how do we make money, right? We make money primarily with support and grid. Right? We do all the full life cycle test to support L1, L2, L3 and it's delivered by people who actually have done like 90% of the code. Training, you know, at least three different courses, both in a classroom setting and on site. So basically you're a company and you do things like that. There's some more information with you guys. That's it. Any questions? You spoke about yarn. In that you said you can plug in now MPF. I remember somewhere you said about storm or S4 or something like that. So can you explain how would that work in this? How much time do I have? Five minutes. So the way this would work is right, so the way this works is in the new system what happens is when you submit a job, right, the first task that you get becomes the master for that job. So every job will have its own master. So for the map-reduced job, you get a map-reduced master, right? That master is now responsible for talking to the resource manager and getting more containers. They call them containers. They're equivalent to tasks, getting more containers. So once you get the containers, you run them on whatever node you got them on. So this guy's containers are in here. So what this means is the way you would implement MPI or storm or S4 is you have to implement an application master for storm or S4 or MPI. Right? And then you can get containers from the resource manager and run storm tasks or MPI tasks or, you know, S4 tasks. Right? And then you can actually scale and do pretty much anything you want in the system. Right? So one thing you can actually do is actually launch virtual machines if you want. Right? Talk to the resource manager and launch virtual machines. You know, that's like any cloud platform. Right? As I answer first. These are the paradigms. Do they align with HDFS data storage model in terms of replica along data partitioning and running them where the data is? They don't have to. I mean, they can choose to because there's, of course, HDFS. So you can do exactly what Matthew's application master will do. He will figure out where the data traffickers are and try to schedule them on those racks and those nodes. And the resource manager will also help. So you can ask the resource manager if you're asking for a container on node A. He'll try to give you that container on node A. If it's not available, he'll try to give you on a different node in the same rack. So the resource manager will actually give you a lot of them. So you as the person writing the application master can choose to take advantage of that. In things like MPI, you probably don't care. Storm, you probably don't care. But other things, if you're writing like some like Giraffe, for example, if you're doing graph crossing, you probably care. So it depends on the paradigm and it's up to you. So what we think is over time we probably have like four or five or six standard application masters which we will support. Anything, and those will be written by people who understand it and will be optimized for that paradigm. But if you're trying to do something on your own you can of course choose to figure out what it is. I mean, this gives you the APIs, right? How you choose to use the APIs is essentially up to you as the application. But what we expect is there will be like four or five or ten standard ones and they'll be implemented by people who really understand the system. So it's probably optimized for that. Like MapReduce, we spend a lot of time on it and MapReduce application master does all that work and more, right? But it depends on the use case, essentially. Essentially, this is an operating system, right? You're writing an application. How do you write an application practically up to you? Since you are launching this HDP the full platform so should we expect distros like Yam or... Yeah, absolutely. So you have not only distros like Yam you have RPMs and Devs but you have seven different platforms. Fully compatible with these. Yeah, absolutely. You can go download it. You're making part of Linux except some distribution itself. Unfortunately, I can't talk about it. At this point. The traditional cluster kind of you have a job scheduler like I already said or something similar kind of so what is your perspective from that? So this is similar to I mean this is actually like similar to Torq or Moab or something except that it does a lot more than any of this, right? We've come from a different angle where none of the existing ones like Slum or Torq or Moab they understand data locality, right? The big difference here is that the resource manager actually understands data locality and that's really critical for MapReduce and it's also critical for a lot of big data applications whether you're doing MapReduce to graph processing, massive graph processing that data locality matters a lot, right? So this guy is similar to Torq or Moab or Moab but you know more advanced in that he actually cares more about the character so that's the big difference It's bridging to the resource manager it's not, you don't have to do it in the application master the resource manager can do it Really decide the locality and how the data management movement needs to happen Correct. You can say that I want no container on this node, right? The resource manager will say but look, I can't give you this container on this node but I understand that I have another node in the same rack which is very close so you probably get very good locality anyway and all of that smarts built into the resource manager Yeah. That's something that, you know, not nobody else no other resource manager actually has So now the resource manager is like a single pointer failure? Yeah, so the question is what would availability so the single, I mean working to actually, I mean we have like code it's not yet well tested so what we're doing is all the resource manager state is back up on Zoopi so if something happens to the resource manager we can quickly reboot on a different resource manager and he can come back online in a matter of seconds actually It's not like HTFS where you have to store a lot more data there's a lot more state to store an HTFS in the name of it. Here it's actually where it will stay It's literally like 256 bytes per container or something and there will probably be a few maybe 100,000 containers so it's very, very small so it's easy to back up So in that case how would a backup resource manager be brought to the chain using a VIP how would it be the Zoopi. But how would the clients actually do it? It would be a virtual IP over there it depends on the installation IP failure is the standard code most enterprise is like IP failure it's been about Umbari but it's been I think inactive for a long time what is the roadmap for Umbari? I wouldn't say it's been inactive but you should definitely see a release in the next I would say in the next 30 years so you should definitely be able to download it and play it you'll do all of that install and dashboards and monitoring You mentioned that unstructured data is the majority so what is the strategy to I mean are you HDFS any kinds of data? I mean in the sense that understanding any form of unstructured data which could be very, very vertical specific how do you see, without that not a lot of processing can be expressed So that's why you know H-Catalog is important with H-Catalog you can actually take an unstructured data and still describe it H-Catalog doesn't have to work with only an RDBMS it works with stuff on HDFS or H-Base so that's why H-Catalog is very important H-Catalog will capture the metadata but what I meant when you talk about H-Pars I mean so suppose you want to sell into financial domain which deals with credit cards you have a specific message format unless you have kind of interpreters for those you really can't process the data itself So what happens is people to their audio have already written it in a lot of cases they choose to go with something commercial like H-Pars so lots of them available why not as good as H-Pars because you know Informatica's Mali worked on it for a long time it depends on the quality but they're definitely optional Sorry guys can we stop talk Arun will be available sometime after this so people can catch him I'll try to run first