 the ironwork that he was instrumental in. So he's really led the platform to maturity and then really helped to move it along into direction it needs to go. So let's go back to those early days when you were developing Hadoop and it was just kind of a thought in your mind. What were your ambitions for Hadoop as a technology, as a platform? Were you looking beyond the application that you were looking to apply to when you were developing it? And did you think it would develop into what it is today and becoming an enterprise-grade tool? So Google published these papers about the way they were doing things. And they were addressing the same problems that I was working on in Nutch in an open-source project. So there's an obvious opportunity to take those and improve Nutch fundamentally, which we set about doing. But at the time I realized there were uses beyond Nutch probably that people admired that work that Google had done greatly, but they couldn't use it directly because it was only useful at Google. And then making an open-source project would probably bring it to more people and I hoped it would succeed that way. But I liken it to a football play. You've seen a diagram for a football player. There's all these arrows. And in every play that the offense charts out, there's a touchdown. Every defender is taken care of and you score. Now we know that that's not the case. And so I mean, I obviously hoped and tried to plan for a successful open-source project. You know, I've started a number of open-source projects. Some of them have been successful, some moderately, some more so. And so, but you don't score every time. So I hoped it would be big. I had no idea that it would be as big as it has become. That was certainly not something I was imagining. I was thinking within the context primarily of building search engines and maybe doing log analysis, but not a lot beyond that. And similarly, Arun, when you were thinking about yarn in that wired article I was just reading from a couple years back, you mentioned how you woke up at 3 a.m. in the morning and you were concerned about an issue you had and that kind of one of the things that inspired you to develop yarn. Were you thinking just about, how is this gonna help me solve my problem? Or were you even back then thinking about how this could be applied to other enterprises, other business problems? You know, at that point, you know, the world was very different, right? I mean, Hidup was being used by a small number but very influential number of companies whether it's Yahoo, Facebook, LinkedIn and so on. For us, it was, for me personally, it was just about let's do things better, right? We would all, you know, we all really liked working on the project, we sort of felt like we put a lot of blood, sweat and tears into it and we sort of all looked at it as our baby, right? We want to see the baby grow up and we wanted to help our users, at least the internal users, right? And we were running it as a shared service internally. We wanted to help our internal users do more things with data and we were seeing that people were twisting, MapReduce beyond recognition to do something. They were using MapReduce to do web serving, they were using MapReduce to do machine learning and now that was working well, right? MapReduce is an amazing, amazingly simple paradigm which is why it's been so successful, right? And obviously, Doug's helped us, at least even initially helps us understand that we have to keep things very simple, right? If you don't keep things simple, you're not going to be successful. But then at the right point, you have to introduce the right set of complexity, if you will, and go beyond doing, you know, log analytics and search. And that was primarily, I mean, around the 2009 timeframe, web search was, you know, the investment in web search internally wasn't extremely high, but we were also looking at so many use cases beyond just web search at that point. So it was actually, it was almost staring at us in the face. So let's talk about the people out here in the community and the role they've played and the role they're going to play in the future. You know, for the first few years of Hadoop, you know, there weren't any even commercial companies trying to sell Hadoop-related products. Of course, over the last several years, we've seen just a kind of an explosion of startups and even a lot of the larger mega vendors getting involved in this market as well. So how is the future development of the platform maybe going to contrast to its earlier days when it was a little bit more of a, not as maybe as organized an ecosystem, and not as large for sure, certainly not as much money being pumped into the ecosystem. How is that going to impact the way we see this moving forward? Is there tension between some of the vendor community and the open source community? How's that going to play out, do you think? I think the vendors are motivated by the users. So there's a level of indirection, but it's still the same fundamental motivation. You know, you somehow people have itches and the open source project is a way to address those, to scratch those itches. And typically, you know, some companies will invest them directly and contribute to the open source and some companies will complain to their vendor and the vendor will invest and improve the project. But I see it growing, continuing to grow organically through a combination of vendors and folks who are getting involved directly. We have a lot of customers at Cloudera who contribute as well. They get support from us, but they also in some areas get so involved in a project that they'll be making additions and changes to it. And so the two aren't even exclusive. So I don't see a real fundamental change with the presence of vendors in the space. I think they give a way for people to, in some ways, fund more development and direct it without having to hire all those people so you get some economies. But it's still motivated by need, I think, more than anything. Arun, what's your take? You know, about our point about users, I think especially if you're in this room and you used Hadoop four or five years ago, I can't thank you guys enough, right? If without the users, the software wouldn't be useful at all. It wouldn't have had that level of feedback which would be important to actually fix it or improve it in dimensions. I think as more people get involved, I think it's only gonna get better. I'm, in a lot of ways, I think Hadoop's got really lucky because it's housed in the Apache Software Foundation and that plays a really big role in terms of helping create an environment where people, we can compete with each other as companies but at the end of the day, we have to act as individuals in the SF, right? And you're seeing more and more investment. You're seeing investment from Cloudera. You're seeing from Hortonworks. You're seeing Microsoft invest in Apache Hive. That wouldn't have happened. You'll see IBM show up. You'll see all these guys show up. A lot of reason is also the Apache Software Foundation. So I think it's actually in good shape and it's only gonna get better. And what do you need to do to keep the community aspect vital? How do you keep that going? Because it's a critical component, obviously, of the development of Hadoop to this point. And as you've both said, it's gonna be a critical component going forward. How do you maintain the vitality of that community as it starts to grow, as it starts to, new projects start to come about that are associated with Hadoop, maybe not core to it. How do you keep that community vital and active? I mean, as Arun sort of indicated, Apache is really designed to foster ongoing thriving living communities. And so far proven very successful at that in the Hadoop space. We see whatever there are, 15 or more projects today and more on the way every month almost, there's a new one started. So I see that continuing. I think it's very healthy that we have a large number of independent projects that integrate together loosely though. They're not tightly integrated, which permits the whole ecosystem to evolve. There is no central point of control to some degree the Hadoop project is the central point of integration today, but there's no reason long-term that it has to be. The community can really vote with its feet, depending on what things it installs and the whole ecosystem can evolve, which I think is a very strong resilient architecture. And I think Apache's given us the foundation for that. And to add to what Doug said, it's also the key of the users. We have to continue to make Hadoop easy to use, easy to consume. And at this point, if you go look at a small example, as we look at the mailing list that we have, we probably have 10,000 people or so on the mailing list across all the projects. That's the indicator of the amount of interest in an open source project. I think there are other examples of projects like Linux, which have been around for a long time and have been very, very successful. So I think the key part is to make sure that we adapt to the needs of the users and make sure that the software is being relevant on a daily basis to these guys. So let's talk a little bit about the technology as it applies to the enterprise. Well, how do you see it playing out today? Where do you see Hadoop fitting into a larger architecture of a more traditional enterprise, not necessarily a web company? And how do you see that going forward? I mean, you've got different views. You've got some that say, Hadoop is going to be the center of the data management, architecture, infrastructure. Others say Hadoop is more about the complementary approach to some of the things you're doing now. How do you fundamentally see Hadoop as it evolves over the next five years and what role will it play inside the enterprise? There isn't one answer for every institution. Everybody moves differently. Some people, more often, small new companies, but sometimes big companies will really standardize and centralize all of their data in Hadoop and move more and more things to it. Other folks will deploy Hadoop in very point applications where it's providing clear value and there's everything in between. I think the platform is really designed to permit people to affordably store more data in one place with a variety of tools. And so that enables that first model, which really wasn't available before to people in enterprise technology. And we think that's a very exciting direction and that's going to be an ever bigger one. But it won't be universal, I don't think ever. I mean, there's always going to be niche systems, particular things that are going to live outside of that hub of data. But we think that's the trend. It's a long trend. We're in the early stages still. Most folks are getting started. But I think most folks are also seeing that's a direction they want to move towards more centralization, more variety of services on the same shared data set that this platform provides. So I think we need to plan around that and engineer to try to facilitate that. But we also need to facilitate interaction with existing tools and with external tools because that's a reality for now and it will be a reality into the future as well. So I came over for a classic web to auto company where we had all the license in the world to reinvent the wheel and the rest of the bike. But as we go on to the broad enterprise, we are absolutely aware of the fact that people have investments, both CapEx and OpEx, but equally importantly skills that have been around for people have been doing SaaS and all these things for 30, 40 years. They're not gonna turn around tomorrow and start writing Java MapReduce code. So I think it's very important for the platform to be broadly applicable that it really, really plays well with the rest of the ecosystem. In the end of the day, what we wanna do is provide value. What Hadoop has shown is that there's a way to store and process data at scale and when you can do that, people start to look at data in a different light. People start to understand that there's inherent value in data and there's probably inherent value in storing it and not dropping it. I saw a joke around that it's cheaper today to store data than to make the decision to figure out what to throw away. And that's what Hadoop is showing and once you get there, we have to work well with the entire ecosystem, whether it's SaaS or SAP or RTAR data or Microsoft. Well, right, I mean, so there's a lot of talk about the impact of Hadoop on some of the more traditional vendors. Is Hadoop a threat, do you think, to the data warehouse industry, for example? Should they fear the rise of Hadoop or should they embrace it? We see a lot of people today moving workloads from data warehouses into Hadoop. Now, not necessarily all of their workloads, but they're attempting to limit their data warehouse spend. That's a common phenomenon we see today. We don't see a lot of folks replacing their data warehouse outright today. Whether that might happen in the future, it seems possible in some places that that'll happen. Whether it'll happen everywhere, whether there'll be continued to be an edge that the data warehouse will retain in certain cases, we'll see. But even if you just take the case today, that's a threat. If you've got a lot of people no longer increasing the size of the data warehouse, but rather capping the size or potentially even decreasing their investment because they can find they can do much of the processing as effectively and much more affordably in a Hadoop-based system, then I think that's a threat. And what do you think? I think what we're seeing is, historically, what people have done is they've done everything in their data marts and their warehouses. You've had what I look at class A to class Z workloads in the marts and the warehouses. In a lot of cases, the curve stops at M or N and M or N, and they just drop the data. What they're seeing now is if you can actually store and process all the data, there is inherently more value you see from the data. And in some cases, we are seeing people actually get more value from their data warehouses and data marts. So we definitely see that happening. Long-term, I think the key part is Hadoop is to stay relevant. Hadoop today is, it's the same project, the same notes. It's not even nearly the same sort of people. The people we started work on six years ago, a lot of them have moved on. Hadoop has to stay relevant if it has to succeed for the next 10 years. And technology is interesting. Things could change. There are newer storage paradigms coming out. There are new compute paradigms coming out. Memory is getting larger and larger. So the long-term, it's hard to predict sitting here in front of 3,000 people because we all stand a chance of looking stupid five years from now. I don't want to be the next John Juan Newman. Well, fantastic. Thanks, guys. We're just about out of time. Just kind of last question. And as you say, it is a little difficult to prognosticate and think about where we'll be in five years because then any number of things can happen. But nevertheless, I'm going to ask you to do that. Final thoughts. I mean, we're at this show in five years from now. What is this going to look like from a adoption perspective and the role in the larger enterprise? Orin, why don't you start? Sure. I think what I'd like to see happen is for Hadoop to get to a point where it's sort of the key part of your data center. I'd like to see a point where you can install Hadoop. You can just lay out Hadoop in a more easier fashion than today on the data center. And it can manage different classes of storage. We're sort of on the journey right now, whether it's fast disk, slow disk, SSD, memory, different kinds of memory, NV-RAM, and so on. And then equally importantly, manage all the resources. It's CPU, GPUs, disk, flash, and so on. And sort of provide a data operating system, or data center operating system, if you will. And we want to make sure that you can continue to store data at scale at a low cost, and then plug in engines. And equally importantly, plug in all the skill sets that you have, all the tooling that you have, because the open source community is never going to be able to provide all the tooling that you want. So you want to be able to integrate with the rest of the ecosystem. The key part is we want to make it easy for people to access and compute data and sort of get value from it. So in five years, I mean, we've been basically how it's been in enterprises for five years, a little more than that. And it's changed quite a bit. It's instructive to look at that. It's also instructive to look at what's stayed the same. So five years ago, people were finding valuable uses for it. They were also finding rough edges. And they were working to address those rough edges at the same time as they're getting value from it. Now then it was in very few places. Now it's in most institutions in a small way. It's not deployed across all their systems. I think in five more years, we'll see it widely deployed in most data centers of big institutions, harnessing a good fraction above 10%, certainly, of their data needs and probably considerably more than that. But there'll still be rough edges. That's a sign of a living product. As soon as it's like a road, you add a lane, you just move the bottleneck in the traffic somewhere else. And so we'll still be trying to make certain things easier. I mean, today we talk about the Lambda architecture, but it's not particularly easy. I think by five years, I hope that will be a natural thing that people can easily deploy models that are updated in real time that are also based on lots of historic data that should be done. But there'll be other things people will be asking for. And that'll continue as long as it's, I think, the mainstream technology, which I believe it has the potential to be for decades because of this loosely coupled, non-centralized architecture of the projects. So anyway, we'll still be here. Folks will still be complaining, still be adding features. But I'd say it'll have a lot more actual use in institutions. People today graduating from universities are familiar with it and starting their careers. In five years, those folks will be further along in their careers and will have accepted this as the norm. And it takes time for technology to turn over in institutions, and it takes time for people to turn over, and the two are connected. And so I think we're going to have at least a generation for whom that's the default technology. Absolutely. Well, it's going to be fun to watch. So, guys, thanks so much for joining us on stage today, Rin Murthy, Doug Cutting. Let's give them a hand. Thanks, Jeff. Love you. Jeff, thank you. Thank you for monitoring. Thank you very much. And Doug, in a room, thank you both for doing this, and also for everything you've done for the community and for Apache Hadoop. I want to give these guys a round of applause. These guys are really welcome. Thank you. OK, so now we're going to have Oliver Ratsenberger from Teradata come up. And there's Oliver Renzau's software engineering for Teradata. And as you know, Teradata is one of the providers in the data space, but also very early on in terms of how they thought about Hadoop, the unified data architecture, and how they started to think about how do you start to incorporate Hadoop into the overall data architecture, the data warehouse discovery platforms, et cetera. And Oliver will give an interesting view of how they're looking at the world and what's happening with that, Oliver. Come on out. There we go. Hi, thank you. Here you go. Thank you. Good morning, everybody. My name is Oliver Ratsenberger. I'm with Teradata. I'm responsible for software development for everything that we do with Teradata, AST, or Hadoop integration software components that we write around it. I spent seven years at eBay learning from scratch basically what it means to do big data at scale, responsible for things like the data warehouse, for integration with technologies like Hadoop. And spent two years at Sears taking some of these principles and concepts into a very old retailer. And for the last year, I'm not responsible for Teradata, for software development, really building technologies around the integration of these different technology stacks. You can follow me on Twitter. So let's talk about what most companies are facing today. There's a huge amount of choice, whether it's cloud, whether it's big data, whether it's NoSQL, whether it's Hadoop. There's different distributions. There's different projects within the open source communities. How do I pick? What do I choose? How do I build a foundation for a company? And if you throw large volume big data into a mix, it just gets bigger and larger and more complicated. Sometimes companies think that big data, by adopting big data, you become agile. I think it's actually the other way around. You have to first be agile. You have to first master how to be fast with data and then do it at massive scale in order to be successful. And so let's have a look at what we at Teradata do to make that happen. The architecture that we propose is what we call the unified data architecture. It brings together different technologies. And I have some examples here for you today to show how that actually works and how our customers are using that in their environment. As part of the unified data architecture, the data lake plays a really important role. And Hortonworks and Tatoo 2.1 and Yarn really take the data lake to the next level. For us, the data lake is kind of layer zero in a layer data architecture where you start acquiring data as quickly as possible, as agile as possible, but also with the right metadata and the right tracking lineage in place so that you can come back at a later point in time and remember what you stored actually there. With 2.1, we see Hadoop really moving to the next phase. As Doug and Aaron just said, and we heard earlier today, I think the introduction of Yarn has really opened up Hadoop to become much more general purpose big data platform than it was before. MapRed use had its use cases and had its abilities to do something very simple at a big scale, but it also has quite a bit of boundary and limitations. And I think with Yarn, that moves Hadoop certainly well beyond these limitations and allows us to do new use cases, new abilities, and bring new platforms onto the ecosystem. So what do we at Territor do with Hadoop, and especially with 2.1 and with Yarn? We just recently released a new product that we call Territor query grid. And it's the ability to bring together different platforms natively through a self-service architecture, eliminating the need for manual data transfers and really integrating different types of analytics. And I'm going to have some examples for you that I want to take you through on this. The big difference between traditional data movement and the query grid is that traditional data movement was its tickets, its IT people, its opening firewalls, its exporting files, it's doing it in batch. It's fairly slow, and you can do it maybe once a day, maybe a couple of times a day, but that's basically it. Query grid takes data integration to an entire new level. We do it very dynamic. We do it on demand. We do it in memory. We bring results sets from different processing platforms together where they need to come together where the best capabilities of the platforms are. And we do it from within a very simple workflow. And I'm going to have an example here for you that I want to take you through and show you. So let's take a look at how query grid works. I have a little animation here for you that I want to show you that once it plays, let's see if we can get this started here. Oh, no, that's too far. Let's try this again. Can somebody start that animation, please? OK, that is not working. Well, let me tell you what it should be showing you here. What we have done with query grid is we have taken Hadoop and we have taken technologies like Teradata and other technologies and brought them together at a level where if you come and put these two systems on a common interconnect, you can pass down a request for data from within, for example, Teradata into Hadoop. The Hadoop system and the Teradata system will automatically negotiate what is the most scalable way to talk to each other and will at interconnect speeds exchange data on the fly. Now, that can be existing file splits that sit on Hadoop that needs to be brought, for example, into a query for Teradata. And if you have a small system, you get gigabytes per second. If you double that, you get tens of gigabytes a system. If you really scale that out and take that to the limit, you will find that with very, very large clusters, you can move up to 10 terabytes a second between different types of processing clusters. So this is not the old export of file weight a couple of hours for 500 gigabytes of data to be copied somewhere around. This is real-time integration. And the way it works, it also supports pushdown processing. And the example that I have for you here is we support the embedding of, for example, HiveLogic into the pushdown where the Hadoop cluster will first process the Hive query, create the result set, and take that high speed interconnect and bring that data back into memory for further processing. So this combines really very different technologies. So whether it is Teradata and Hadoop, whether it's Teradata to Teradata, procedural off-the-grid processing, traditional SAS or R processing, all brought together in one query grid. And I have an actual example that I want to show you here today. And we heard Tom talk about cars earlier today and about the amount of data that it produces. This is an example here on the missing half of the text here too. This is an example of hybrid cars and electric vehicles, how they are selling within the United States and worldwide. There's a total of more than 3 million battery-assisted or battery-powered vehicles out there. Each of the vehicles has hundreds of sensors. Within the next few years, we'll probably have billions of sensor readings for these vehicles. And the example that I have for you is the following here. What you see here is on the left side, these are one type of hybrid car batteries. The way they look like before they get put into a car, when they work, when they operate, the right side is when something went wrong. It's actually not the most catastrophic way. This is just when they start overheating and when they start buckling. The problem with batteries is that the power density of these batteries has gotten really, really high. And we're approaching cars that will soon have more than 100 kilowatt hour batteries within the vehicle. And the problem with these batteries is that they need to operate in a very narrow operating range. You can't discharge them too far. You can't overcharge them not even once. You can't operate them outside of a certain temperature range. And so naturally, that produces a lot of data. And the example that I have here for you is we want to analyze data that we have from service logs, from service visits, from customers where we download freeze frame data from some of these vehicles and combine it with data that sits in Hadoop. Again, we have half of the slide missing here. Sorry for that. OK, let me see if I can get past this here. OK, none of the slides are coming here right now. OK, so while the technicians try to figure out the small data system in the back here, I'm going to just talk you through that. The example that I have brought for you was actually a single query that from within a single query accesses and joins data that sits on Teradata and within Hadoop. And we have introduced with Teradata 15 the concept of foreign data stores and foreign servers that you can set up for a system that leveraged the high speed interconnect between those systems to exchange data. And once you set them up in your dictionary of the Teradata system, you can access any Hive query, any data set that sits on Hadoop from within the SQL dialect of Teradata. You can even within a derived table put Hive query straight into that query and operate it as a single query workflow. What the system will do is it will pass down the pushdown processing on the Hadoop cluster, create the data splits of the result set, bring them in real time back into temp and spool of the system. And again, we can do that up to 10 terabytes a second at very, very large scale systems and then do the joins with the structured data that already sits in Teradata. And so what used to be multiple people filing tickets, exporting data, waiting for hours to move data between one system to another can now be done from within a single query, from within a single app. It leverages all of the security and workload capabilities that Teradata has because you can wrap every single request that you sent into Hadoop in our workload management framework. OK, it will be cool. So the other thing I'm just, I think, going to give up on the slides here. The other thing that I quickly want to talk about is, so this week we introduced our new Teradata portfolio for Hadoop. We just released Hadoop 2.1 as the first industry appliance out there. Why are we doing appliances for Hadoop, you might ask? For a lot of our customers, it's really a time to market question. I need to be fast. I need to stand up Hadoop cluster. I need to have it in production in a week or two. I cannot take six months, nine months, to learn how all the individual components work of this infrastructure. So what we have done is we have prepackaged up basically Hadoop 2.1 with the workload management capabilities with what we offer query grade from Teradata and packaged it up as a solution that you can roll in and get going within days rather than months. We think that Hadoop 2 is really an important milestone for us in order to integrate the different capabilities of these platforms. And we want to allow our customers to really get an enterprise-ready solution from a single vendor that integrates not only Teradata, not only Aster, but also Hadoop and other technologies for them through our capabilities. With that, one more test. No, we won't get back here. OK, so with that, this is very quick an overview of what we have done at Teradata. Unfortunately, the slides didn't do it any justice here. I welcome you to see us at the booth in the exhibition hall. You can actually see that live demo, and you can actually see the query. There's actually a system installed with Teradata and Hadoop on it. You can see the exact telematics, frame-free stator integrated, and go check it out. And last but not least, I encourage you to check out our Teradata slash careers web page. We are hiring very heavily in the data science and Hadoop space. And so if there's any interest for you to build enterprise-scale technologies, we welcome you to take a look and hope to see you there. OK, thank you. Thank you, Oliver. Appreciate it. OK, so now we're going to do is switch over and go through the view of, as you look at data lakes and you start to see as data is distributed, what are other ways you start to manage it, and how do you look at streams, ponds, et cetera? So with that, I want to bring out Sean Connelly. Sean is VP of Strategy at Hortonworks. He'll be taking you through this. Sean, she'll be coming out on this side. Thank you. Sean, welcome. Always good. Sean's always entertaining, so I know we're in for a treat here. There you go, Sean. Put the pressure on me, Herb. Welcome, everybody, to Hadoop Summit 2014. It's definitely grown very large. I thought it was great to see Doug and Arun on stage earlier, really talking about the genesis of where all this started. I joked with both of them back in the green room beforehand. I think what would have made it a little more exciting is if they were dressed up in those like big, giant sumo costumes and they bounced off each other. So we didn't get that part of the show or story, but definitely hopefully people came away with something to think about. As far as my journey into this enterprise Hadoop space, I joined Hortonworks towards the tail end of 2011. And people talk about traditional Hadoop versus this yarn-enabled wave that's upon us. But if I reflect back on just even in 2011, 2012, where things started, some of the sentiment when people are truly trying to wrap their heads around Hadoop in this broader ecosystem of technology is it wasn't a thing, it's a bunch of stuff. It was H-base, HDFS, pig, hive, a bunch of components. My favorite was a bunch of animals in the zoo, including the zookeeper with his shovel and things like that. There were technologies like pig and pig speaks IgPay, Attenley, as far as the pig Latin dialect. And then really the sentiment around that was people trying to envision, is this going to become a platform that will be relevant to mainstream? And there was a sentiment of nothing at the center that kind of holds it all together like an operating system. So hopefully between yesterday and today, if we focused on something, yarn has fundamentally changed how we think about Hadoop. And I just couldn't help myself on this Batman meme. Somebody posted one of these, I think, like nine months ago. And I bookmarked it in my head. I was like, I have to use it for this audience today. But basically, yarn is about enabling more workloads, more data, and more value. And so we talk about batch interactive real-time workloads. And I think fundamentally people are very familiar with traditional Hadoop around the batch workloads. And we talk a lot about interactive and real-time. And so Hadoop does interactive and real-time. And while I'm another person from the East Coast and I'm not from Missouri, I do have a healthy dose of show-me when we talk about things like interactive and real-time. So I tend to like to head to the demo kitchen when we talk about these new things. And so at this point, I'd like to bring up George Vedicaddin, who's from the Work Solution Engineering team. So George, everybody welcome George to the stage. We're going to allow George to get set up in his demo. So while I allow him to sort of get his demo functioning and those types of things, let me set the context for what the demo is going to cover. And again, I really want to change the thinking on how we think about Hadoop. It does interactive and real-time. And so setting the stages, we have got a trucking company with a large fleet of trucks in the Midwest. A truck generates millions of events. Some of those events are normal events, start, stop. Others are violation events. Speeding, driving too close, weaving in and out, those types of things. And the company uses an application, a real-time application. It's actually a map view as well as a table view, and we'll see shortly, where they monitor the trucks on their routes as they're driving and track the real-time violations. And so the dots that represent the trucks, you'll see them getting bigger if there are more violation events versus normal events. As far as the interactive part of it, as far as interactive querying, all that history for months and months is stored so we can do some analysis around, are there certain routes that cause issues over time or trucks and maintenance issues or are there drivers that are at an issue as well? So there's an interactive querying piece of the demo. If we look at the ingredients of the demo and we'll go back through, nice, crash. If we go back to the ingredients of the demo on a way for the screen to come back is we basically have Hadoop and Yarn at the basis, right? And we have our really cool trucks that are emitting their events. Those events flow into Kafka, which is a sort of an inbound messaging system. It will route the events down into Apache Storm for the real-time stream processing and event. It'll store those events directly in the HDFS for our historical view. But also it will funnel the events over to HBase for populating a tab review of a couple of days worth of data or what have you. So you have the most recent history that you can report on and then alerts will go up through, in this case, we're using ActiveMQ to deliver to a real-time mapping application the dots, the locations of the trucks, as well as the violation events on those trucks. And then on the flip side, we'll look at interactive query with the Hive 13 on Tez to interact with, I think it's over a terabyte of data. And again, this is a live demo that's hitting sort of a back end. So always fingers crossed whenever you do live demo. And we'll be using Excel and PowerView in order to ask questions of the data from an interactive query perspective. So with that, I wanna hand things over to Chef George here. He makes a good meal, as I understand it, not only in the demo, but he's from the Chicago area and they actually have good food in that area. So George, I'll hand it over to you and you'll walk us through things. Sure, thanks, Sean. All right, let's get the map up. And those trucks are moving really fast. All right, so what you're seeing here is a web application, a real time dashboard, right? And that's powered completely by Hadoop and by Yarn. And this web application is actually used by a trucking company's support team to monitor these fleet of trucks in real time. So each of these circles that you guys are seeing represents individual trucks on different routes, somewhere in the Midwest region. And I know the trucks are moving fast. That's the point of the demo, guys, okay? So whenever a violation event occurs, and a violation event could be speeding or unusual tail distance and those kind of things, an infraction event gets immediately generated and rendered through this application, okay? And if you notice in particular, there's one circle, the brown circle, that's considerably larger than the rest. And what that indicates is that this particular driver is significantly, has an increased number of violations over a shorter period of time. And it needs to be closely monitored. So this real time view into the driver's activities and into the truck's activities, right, is completely managed and powered by storm. Every event that's generated by the truck flows through the storm topology. And what storm does is it listens and filters for certain of these violation events and pushes it to the web application. It also has business rules configured such if the number of violations for a given truck exceeds a certain threshold, then it pushes out alerts. And this web application subscribes to those alerts and renders those alerts by increasing the radius of the circle. So if you look carefully, that particular brown circle, there's a message to the end user, to the support system that we need to contact that driver immediately. So let's go and do that, right? So let's say the contact team radios to the truck driver and says, what's going on? I see an increase in terms of the number of violations. And the driver says, these violations are not because of me, but because it's the route and the truck you've given me. The truck is old, it needs to be serviced and that's the reason for these violations. So the analyst teams at this trucking company has some root analysis it needs to do. What is the cause? Is it the truck? Is it the route? Or is it the driver himself? So let's answer these questions by using a BI tool that almost everyone in this room is familiar with, Microsoft Excel. So I'm gonna switch to that view. So one of the first things I'm gonna do within Excel is I'm actually gonna connect to the Hadoop cluster and I'm gonna execute an interactive query using Hive on Tes where I wanna have all of the violations that have occurred within the last six months. So the first step is selecting the data source. The second step here is selecting the table where all of the events have existed for the last six months. And then the third step is applying my predicate filters. So I wanna capture every event that is a violation event. And I'm gonna go ahead and execute this query. It's important to know that this query is actually executing on the Hadoop cluster itself. And the result set that I get back is what you guys see in this Excel sheet, right? So keep in mind the results came back so quickly because we're taking advantage of the new interactive query engine, Tes, that you've heard a number of folks talk about. So now we've got the data set. So what's the first question we need to ask? Are there specific routes that are more susceptible to violations? So to answer this question, we're gonna use a capability in Excel called PowerView Maps. It's a data visualization exploration tool. And what I've done here in this view is for every violation in the last six months, I've mapped it onto this map based on the lat and long coordinates of where that violation occurred. So you see things for St. Louis to Memphis, Springfield to Columbia, Springfield to Columbia, another route and so forth. And what you can infer from this is that there is not one specific route that seems to incur more violation than others. The answer to our first question is no. Seems like all of these routes are equal opportunity offending routes. So let's go to the second question. Are there a subset of trucks that are more troublesome than others? So the view that you see here on the x-axis is the trucks and the y-axis represents all of the violations that came from that specific truck over the last six months. And as you can tell, there's definitely a subset of trucks that seem to have an increased number of violations. That's good. So our next question then becomes, are there specific individuals who are driving these trucks that are causing a majority of these violations? To answer this question, what we can do is we can impose the driver data on top of this view. So let's go ahead and do that. If you look at this, there's a couple of things that are very telling. First, for the subset of trucks that caused a lot of violations, there was a number of drivers that drove that truck. The second was, there's one driver in particular that caused a majority of those violations across all those trucks. And this driver is what's represented here in green. So it wasn't the route, it wasn't the truck, it was the driver across a number of violations, across a number of trucks that have caused the most amount of violations. I think we found our rogue driver. Absolutely, thanks a lot, George. I think you illustrated a little bit around the real-time and interactive nature of things. And if we have a key takeaway here, and my inner child always comes in the place, the key takeaway is watch out for that rogue driver. Just a little inside baseball for you guys. I think I just threw George under the truck, so to speak. That's George's boss, Jamie Angiser, the rogue driver. So watch out for him, I think he's here. All right. But no, the real takeaway is, yarn enables the broadest set of use cases. And we just saw a set of use cases that if you think about Hadoop from a traditional perspective, you wouldn't have thought of before. And so interactive SQL, no SQL, streaming, search, in-memory, machine learning, and interactive analytics with Spark, all these types of workloads come in, not only open source, but we also heard from some of the other commercial vendors about the excitement of plugging in their unique data access engines to get the benefit of data locality and running native in the cluster against all this data that they may bring in, but also interact with on broader analytic use cases, right? So common data, many apps, and an architectural center provided by yarn to bring it all together so you can run it all in one spot if you choose to do so. So when I think of enterprise Hadoop versus traditional Hadoop, I think of more of a blueprint for that, where yarn is definitely at the center, you have the access and data management capabilities that surround it. But this is why these notions of governance and security and operations are pretty important, and I'll get into some of that in a little bit, as well as the applications that you build on top of, and how does enterprise Hadoop integrate into the data center with the existing investments, other data systems that you may wanna plug it into, either on the inbound or outbound side, and how do you light up existing skills, whether you're a SaaS user, a SQL user, or in this case of the demo Excel type user, right? You wanna democratize the access to as many people in as familiar way as possible. So surrounding the core with enterprise capabilities is important.