 Ladies and gentlemen, please join me in welcoming Hortonworks president, Herb Cunitz. Welcome. Welcome to a Dupes Summit 2016, but before we start, Bella Strings, come on back out. Let's give it up for Bella Strings for a great way to go start this off. Thank you. Let me kick this off and get started, get everyone going, a little inspiration and some great music in terms of what's happening. So welcome to a Dupes Summit 2016. So for those of you who don't know, this is the ninth summit that's been held in the Bay Area. There have been four in Europe, there are some upcoming ones in Asia, but this is the ninth one in the Bay Area. And this one will be bigger and better than ever in terms of all the ones we've had before. So over 4,000 attendees will represent it across the group, 170 sessions, and you'll get to participate and see them. And then we've also got an award today because we've got 36 different countries represented, but for all of you in this room who complain about having to travel from like San Francisco all the way down to San Jose for a conference like this, there are two people in this room who flew over 10,500 miles from South Africa to come here. So if you're here, a big round of applause for joining us. So we're very excited to host this and to have everyone here this year. But before we get started, because I know of a group like this, unless you're able to log into the Wi-Fi, everyone's not going to pay attention. So just as you go through this, to log into the Wi-Fi, what you need to do is, it's obviously up here, right? The SSID as well as the password, but what you need to do once you've put in the password is you need to go to any website, it'll come up with the MSR Cosmos logo, and then you get access. So unless you do that last piece, you will not get access. I know there'll be a lot of frustrated people in the room. And the most tweets we'll get is I couldn't get on the Wi-Fi. So I want to make sure that you're all able to get access as we get started on this. All right? So first, I'd like to go thank our co-hosts, ourselves as Hortonworks and Yahoo, for co-hosting this event. We've got something special this year that because this event is really about the community, for the community, and to give back to the community, we're going to do something around 10 o'clock. Any of the folks who are here, who are Apache committers, be ready around 10 o'clock. We're going to do something special for you. But we've got everyone here, we've got a whole number of committers who are here, who are all part of the community, right? Everyone from the First Dupe Committer, our own Horton worker, Owen O'Malley, all to the other people that you're going to see here today. And we want to do something special on this anniversary, and so we'll walk through that a little bit later today. So I want to give some thanks to a number of our sponsors, because an event like this doesn't happen with the sponsors. So first, special thank you to Microsoft as our innovation sponsor. Microsoft is also a premier cloud partner of ours, and some of you are very excited to work within the market around HD Insight. So I want to thank them for being the innovation sponsor. A number of gold sponsors, interesting where I think you're starting to see is not only, I'll say the traditional sponsors and on premise and providers and people we've worked with in the past as part of the ecosystem. But also more of the cloud platform providers like Google and others who want to work in this industry and support. So we're very excited to have the number of sponsors. And you're going to see a selection, and I encourage you all, whether it's the platinum sponsors, the gold, silver, or even the exhibitors, that I encourage you to spend time in this session hall, engaging with them, learning about what's happening. Because what's interesting about this ecosystem of sponsors this year is roughly 20 to 25% of the sponsors are brand new this year. And I think that's a great proxy on the ecosystem, how well it's growing and the number of new companies that are coming into the space in all different areas as the platform around Hadoop is getting established. All the waves of innovation that are getting built on top of the platform and all the new companies you see being founded who are participating in this ecosystem able to provide value back to you. So before we go further, I'd like to do a little poll of the audience as we go through here. So for how many of you in this audience, how many is it your first Hadoop Summit that you're attending in the Bay Area? Wow, that is a lot more than I thought. That is a great testament to see the number of people who have come here for the first time. So rather than go through everyone, let's go, how many is this at least the fourth one? Good, a good group. Good. Fifth, all right, we keep going. Six, all right, we got handful, maybe less than 15. Seven, keep raising your hand, we're gonna see who drops out first. Eight, all right, and up to nine. Anyone? All right, all right, we got some. Congratulations, you've been on all of them. All right, so a couple of things around this summit that we'll talk through. So we've got some key themes and there's five things that you'll see different from what we've seen in some of the past summits. And again, this is a good validation of where we're seeing the industry go and candidly what everyone has asked for as part of this event. So the first is what you're gonna see is that data is transforming business and customers are transforming their business. And you'll see that as a key trend in a lot of the conversations of what are companies doing to transform their business. And a big trend you'll see very different this year is a large number of customers, right, people out of this audience who are gonna be presenting either on the main stage in terms of their use case, what they do, how they're leveraging Hadoop and the connected data platform, as well as in the business track in terms of what they're seeing, right. A big thing that we are seeing and I think is important in the reason we have a business track and we've got a lot of these other customers talking is, we've seen the conversation start to expand from primarily a technology conversation to a business value conversation. It doesn't mean it's not about the technology, just means we're seeing a lot of people start to come and say, I wanna understand what the technology does, what does it do, how does it work, what are the capabilities and what's the difference between this access engine and this one, etc. Those conversations still happen. What we're also seeing now is companies, groups, people saying, what does the technology do for me? What value can I get out of it? What use case can I be applied to? How does it work in my vertical? So the conversation is shifting and that's a big reason for the business track and to go provide the people who wanna have that conversation around business value to be able to go provide that to them. Third is connected data platform and this introduced last year, a big piece around, it's not just around how you manage and store data in Hadoop, how you do economical storage, processing and analytics at scale. But even so, how do you ingest the data and how do you bring it into the platform securely, safely and be able to look at that as a holistic ecosystem across your extended enterprise. And now being able to take that connected platform and not only run it on premise, but in the cloud and be able to leverage the dynamics of the cloud and what you wanna be able to run there. But also to say, how do you know run in a hybrid environment? And how do you truly run in a hybrid environment across the cloud and start to determine where your data lies and where you're gonna be doing the processing both on-premise on the cloud. And while doing that, make sure that you don't give up any of the enterprise class capabilities around security and governance. You've got a common way for either metadata management, for looking at the data, for categorizing it and for securing it no matter where it lies. These are some of the core themes, the trends that we're seeing that people and companies are asking for. And as many of you voted for the sessions that are gonna participate as part of a Duke Summit, those are the ones that rose to the top in terms of the highest votes. And you'll see in a lot of the sessions later on through the course of the week. And the last thing I would say is, again, we're all part of a growing ecosystem. Large number of new companies coming into the space as the platform is getting established to say, how do we go get more value out of the data? How do we provide more insight in terms of what's happening across your extended enterprise? And how do we provide that back to both technologists and to business users? So these are the key themes. I would encourage you to spend time through the conference. These are the types of things that we're seeing. Big piece of this is the business track, which is now available. For those of you who are gonna attend, over 20 companies are gonna be presenting at that business track. And roughly 80 of these companies will also participate on stage. Companies like Capital One or Progressive or ASU or some others will be on stage talking about their use case, what they're using the technology for as a connected platform, and how it's providing value to them. So we're very excited about this session. We've got a great show going on for the next couple of days. So why don't we kick this off and get it started? So what I'd like to do is welcome Rob Bearden, who is CEO of Hortonworks, also one of the founding members of the team of Hortonworks. It was founded in 2011. And Rob, welcome to a Duke Summit 2016. Thanks a lot. Looking forward to your presentation. Great, thanks a lot. See you in a little bit. Good morning. How are y'all doing? I hope y'all have had a good start to Summit so far. I'm very excited about the sessions that we have coming up. And I'm really excited about the message. I think we've got a great blend between technology as well as the business value that the technology is driving. So we want to spend equal amount of time on both. But I'm really excited about two very big milestones that we're going to be celebrating here during the Summit. The first is the 10-year anniversary of Hadoop. And it's amazing to see what's happened to the Hadoop platform over the last 10 years in the ecosystem and the community that's evolved from it. The second thing that I'm so excited to celebrate is on Thursday it'll be Hortonworks' five-year anniversary. And just very excited to be part of this community, part of this ecosystem. And I think that what we've done with Hadoop to really in the last five years get it to become part of the mainstream data architecture and the enterprise and the things that we're doing. We have so much more opportunity, so much more work to do. But I've said from day one that the opportunity for Hadoop to create value to enable the next generation data architecture across the enterprise was too big for any one company to accomplish. And it had to be about a collective effort between the various technology providers, the community, and the ecosystem. And we have to do this together to create the Hadoop platform and help it to reach its full potential and drive the next generation data architecture into the enterprise. And I think we're deep into that. We've accomplished a lot and we're on a great path with a lot of momentum. So very quickly, congratulations to all of us and appreciate all your work and help in what you've done for us. Thank you. But because of our collective work and the success that Hadoop has had in the enterprise, what we've really done is we've created and enabled a data revolution. And this data revolution is gonna change fundamentally the way every enterprise and every industry works and how every line of business interacts with their customers, their supply chain, their trading partners and their customers for the next 20 years. And you know, this data revolution that we've created, it's gonna change how our fundamental lives and our family's lives also operate for the next 20 years. It's gonna change how we shop. It's gonna change how healthcare is provided to us. It's gonna change how we receive the healthcare from the different providers. It's gonna change how every financial transaction is processed and managed and we're already seeing that happen now. We've already seen the first phase of transformations in the transportation industry and how transportation functions in the future is gonna be monumentally different because of the data opportunity and how it can be used. And this data revolution has clearly changed how we interact with the media and how the media interacts with us. And you're seeing the very early stages right now of this but it's gonna become the norm going forward. And for any product that we use, whether it be the car we drive, the airplane we fly on or even down to just the t-shirt that we wear in our workout gear, that there will be the data coming off that's gonna be monitored so we can understand how well that product's functioning or any issues that may be associated with that product at any point in time and be able to understand is that product functioning properly and is the customer happy with the outcome of how that product's functioning and to be able to create very closed loop type interactions with the customers. And because of this, we're gonna create very large new economies that are gonna generate billions of dollars of new value. And because of this data revolution, we're gonna be able to create transformations in the industries over the next five to seven years. They're gonna be monumentally bigger than the transformations that have happened over the last 50. And Herb Kunitz, our president, has a great saying. He says, the world's changing and it's because of the data and what we can do with it. And I think that's so very true and we're seeing this happen across every industry. We're gonna talk a lot about that during the summit. So and what this data revolution is really driving now is it's beginning to transform the enterprises. And this is happening, as I've said, across every single industry. And many of the enterprises are already generating massive value from their modern data strategies. And from my perspective, the reason we're seeing this data revolution accelerate so fast and begin to happen right now is it starts with there is a massive, massive acceleration of billions of connected devices that are coming online every year. And that we expect as individuals and consumers to be able to access and get data and any content that we want instantly. You know, it was interesting. TechCrunch just published an article earlier this week and they pointed out that in the first quarter of this year there were more cars that connected new connections that came from cars to the AT&T cellular network than new phones connecting to the AT&T network. And I think that's because we realize now that the value of the data of having that connected device and having that real-time connection and seamless view of how the products down to the component level are performing and understanding where our products are at any point in time and how to engage with it. And at the core of the data revolution is we've learned there's value in capturing this data, engaging with this data and we can turn it into incredibly valuable information that we can do things that we've never been able to do for. And the reality is every enterprise is gonna have to adapt their business model and processes to leverage this strategic data or they're gonna become irrelevant and disintermediated across their industry. And so the great opportunity is the platforms are there to do it, the capabilities there to do it and we're seeing early evidence in many, many industries across many enterprises, many of which are here in this room that they're already leveraging these new data strategies to transform their business, create new products and new services. And what this has done is it's really created a new world of business for us. And as I've continued to harp on this data revolution that we're right in the middle of, what we're seeing is that we're able to now gain insights from the data, from what our customers are doing before they ever transact with us. We understand how our products are performing in real time and what that does is it gives us the ability to have a customer engagement and understand what their usage patterns are during the entire relationship life cycle. So not just from when they transact and how the product's performing at any point in time, but the other things that they're doing in the life cycle with that product so we can expand and create new product opportunities and it gives us a platform to create better products with more applicable value much faster. And what this does is it creates a more loyal customer base. It creates a more productive relationship between the business, the customers and the partner ecosystems. And ultimately what this is gonna do is help the enterprise drive faster time to value between them, their customer and their supply chain. And it's gonna create a much faster level of velocity. Over the next several days in the various sessions you're gonna see how many of our partners, many of our customers, how they've created value in their enterprises and not only the kind of value but the way that they approach their data journey to get to the points they are and what they learned in that journey and then what they see as the next opportunities beyond that. I think the thing that we also must realize though that many of the enterprises are just beginning their journey right now. Some of you are in the room right now. We're incredibly happy that you're here. We think this is a great platform at Summit for you to get the exposure on how to go through this journey. But as you're on the journey to start the evolution to your modern data architecture, we realize and what our job and goal is to help you manage the constraints of your traditional data management platforms. And the reality is we're here to help you transition from the traditional models of managing data to a modern data architecture through a very specific set of measured steps. The first step in that is really realizing that the traditional data tends to be largely structured, it's probably fairly predictable, tends to be batch oriented and mostly vast percentage of it's on premise. But what this does is it really creates the limitation on how you can take advantage of that data with any level of velocity. It tends to be very expensive to move and manage this traditional data. There's a lot of overhead that comes with it. And it usually comes too late to make a meaningful impact or to leverage value in real time pre-transaction. What we typically find is that traditionally the role of IT has been to focus on the implementation of very monolithic applications. And as a result, you end up locked into proprietary platforms with very structural and procedural limitations associated with both your data and your applications. But as you then, as we take you down the path to evolve into a modern data architecture, we see that shift happening right now. We believe that the shift in evolution is well into the momentum curve of happening right now, transitioning from the traditional methods of managing traditional data into an overall collective modern data architecture. And what this has done is it's created a new world order of business and how business is done at a very different and new velocity. And for the enterprises to achieve a true competitive differentiation, you've gotta go remove those constraints that have been built up in these traditional systems and processes. And ultimately what this is gonna require you to do is to access all of the data. You've gotta be able to bring together the traditional data sets that sit across the various parts of your enterprise. You've gotta bring the new paradigm data sets in, the mobile, the social, the click stream, the sensor data. And you've gotta be able to then combine that into the streaming data sets and be able to provide a connected data platform by bringing the data at rest, the data in motion into a continuous life cycle and engagement process that never stops, right? And I think that's really the opportunity for us to help you do is to bring all of your data together, both the data at rest and the data in motion into a continuous life cycle and engagement. And with that, the future of the enterprise is about going to be about accessing and analyzing all the data, all the time through its entire life cycle, right? And with that, what we believe the pattern and the go-forward nature will be is the successful enterprises are gonna have the ability to enable environment that leverages all your data irrespective of where it is, whether it sits at the furthest edge, whether it sits at a collection point and router maybe, whether it sits across multiple clouds, whether it's completely leveraging Azure, or whether it just is fundamentally static in your existing on-prem data centers, right? And by bringing all the data together from all the sources and ultimately create a seamless modern architecture, that's gonna give you the ability to have insights about your customers, your products, your supply chain and trading partners at a velocity and in a way that you've never been able to accomplish before. And the reality is, in the new world, data is going to be everywhere and it's going to be in more places that it hasn't been than ever before. That's at the edge, that's sitting in the device itself on the sensor, on the piece of equipment, on the end product, it's gonna reside across multiple clouds and even into the data center. And we believe the modern data architecture is about having the connected data platforms that bring all the data together seamlessly and I think that's the key and that's the fundamental point is in a modern data architecture, we are going to give you at Hortonworks the ability to create a connected data platform architecture that brings all your data together irrespective of where it is in a seamless manner. And when we do that, you're gonna have the ability to be able to now, instead of having to aggregate and move data and normalize it and then post transaction, figure out what happened three days to three weeks after it, you're now gonna be able to take the analytics to the data as an event's happening. And as that event's happening, you're gonna be able to enable a dynamic process to start and execute as that event's happening or as a condition changes in real time. And we're gonna give you the ability to very easily onboard new data sets, new devices and bring value into the enterprise instantly as they onboard. And then through machine learning, you're gonna, we're gonna give you the ability to constantly and continually optimize that business process so that you end up getting the best and highest yield from that workflow and that process and we're gonna continue to learn what those patterns are, what those opportunities are, and then as those events and conditions happen and change, we execute those new processes and create those new opportunities. And this is gonna fundamentally change how the enterprise functions across every line of business in every industry. And in this new world of connected data platforms, we're gonna generate trillions of dollars of value as you begin to transition from the traditional methods of managing data and the architectures that came with that and we help you leverage the connected data platform strategies as a core operations of your business model, okay? And so if we just very quickly think, I think this applies across every industry. If you just think about retail quickly though, in a retail world, traditional architecture, there's been a lot of evolution, a lot of disparate systems. From the backend standpoint, you have multiple customer records, or systems of record for your customer. You have multiple inventory systems. You have multiple methods in which you're planning and driving your supply chain probably not integrated. You're probably not doing and coordinating your demand planning with your inventory management. You don't have a single view of all of your inventory. Probably most of those systems sit very fragmented in a back office processing environment somewhere. Typical retailer though is probably gonna have a web store in a cloud, probably Azure. And then they also are gonna have loyalty systems. So they understand what their relationship at any point in time within a single channel is with their customer. And they're gonna also have, you know, continued response systems, but they don't have a single view of their customer. And the goal of every retail environment is to create a real-time, predictive, and personalized interaction with each customer irregardless of what their relationship is, meaning they may have multiple relationships with that customer, and irregardless of channel. And that's a very, very hard thing to do with the traditional environments and traditional systems. But in a connected data platform, we can eliminate these constraints. And we're gonna do that by helping them create a single view of all the relevant data, single view of the point of sale data, single point of view of the customer information, 360-degree view of all the relationships they have with that customer across all channels. You have a single view of the supply chain, a single view of the inventory, a single view of the financial and pricing and location information. And with that, then the retailers are gonna have the ability to know every interaction they've ever had with that customer irregardless of product, irregardless of product channel or whether that interaction's been physical or digital. So irrespective of whether it happened in the store, whether it happened as maybe a contractor relationship versus a retail relationship. So now they can understand as I have been ordering bulk product, maybe hundreds of thousands of dollars, but I've walked in the retail store to pick up maybe an insignificant item. We wanna make sure that we know because we've seen that individual walk-in, we've picked up through our geo-fence, they're there. We know where they are in our store at any point in time and we know how valuable that customer is because not only are the orders they placed with us, but the kinds of click-throughs they've done on our dot-com environment. And we know all the products and all the complimentary configurations about that product and we can take them through a very structured methodology of an engagement. And I can apply that now to having a great engagement to transact and we end up generating more revenue at better margins. And that's one of the early advantages of Connected Data Platform. But then now we can apply that from a velocity standpoint because as we know in real time what our customer's doing and what they're buying and what their preferences are, we can now design very powerful high velocity pull-through supply chain models and begin to pull inventory through our system, reducing the amount of capital that we have to put to work for inventory and to create best supply chain practices. And we're seeing many of our customers do that and drive literally hundreds of millions of dollars directly to the bottom line with these best in class environments. But there's value like that in every industry and we're gonna talk about these things in different tracks. I'm about to run out of time and Ingrid's gonna give me the hook in a minute. So I'm gonna start wrapping up. But the Connected Data Platforms apply across every industry and we work with not only the up and comers in these industries, but clearly the leaders. Whether that's telco, retail, the financial services, the energy vertical, while there's a lot of price point pressure and the energy, excuse me, vertical, they're using data to get more efficient and get more leverage. And what we're helping them do now is create the modern data strategies that are based on the Connected Data Platform architecture and ensuring that they're able to maximize the value of all their data, right? And so one of the industries that we have a number of initiatives going on right now is healthcare. But one of the announcements that we'll have coming out later today that I'm so incredibly proud of and obviously excited about is the formation of a healthcare consortium. And the mission behind that healthcare consortium very simply is to cure cancer in our generation. And we're gonna do that by building and delivering a next generation open source genomics platform. And within this platform, we're gonna have the ability to incredibly accelerate the analysis of genomic patterns and sequencing. And by accelerating the analysis of the genomic sequencing, we're gonna understand how to go about creating the medicines and the treatments to not only prevent but to cure cancer. And the founding members behind this consortium are Arizona State University, Baylor College of Medicine, Booz Allen Hamilton, the Mayo Clinic, one home, Yale New Haven Hospital, and of course Hortonworks. And the reason that I feel this is important and that I am so personally passionate about this is because we're having the ability to leverage technology to save lives and cure cancer. And we had the ability to do that in our lifetime and everyone's in this room's generation, right? And that's what it's really all about. I couldn't agree more with DJ Patel, who's the chief data scientist at the White House of Science and Technology. And he clearly states that initiatives like this, where we're breaking down data silos, applying technology to accelerate the analysis, to define cures for devastating diseases like cancer is gonna ultimately save lives. So imagine this future. It's amazing to think about what we've achieved in the last five years. We've gone from taking a number of loosely coupled engines and products and taking that, created Hadoop and built a modern data architecture that's leading us to the connected data platform strategies, right? And we've taken from that to driving a platform that's powering a consortium that's mission is to cure cancer and to save lives. So that's pretty cool. That's pretty amazing, right? Yeah, I agree with that. That's what technology's all about. That's what we're about as a community, right? But as amazing as this run's been, this is only the beginning. And you know what? Going forward, the innovation that we've been able to achieve together as a collective team of last 10 years is only gonna accelerate. And our future really is just getting started. And the thing I want you to take away from this and I'm gonna wrap up my story, I'm over my time is that data is every one of your products. Data is the product now. And data creates your opportunity as an individual and an enterprise to win. It's your most leverageable asset and data really is gonna define your destiny. So with that, I'll wrap up, appreciate your time. Thank you for coming to Summit. Let's celebrate the 10 years of success we've all had, you've had. And if you need anything while you're here, please let any of us know that have on the Hortonworks jerseys, okay? Thank y'all. All right, real quick, we've got one more important thing in this session, it'll be quick, but it's very, very important. And I'm extraordinarily excited and appreciative for Joseph Sarash, who's the corporate VP of the data group at Microsoft, who is, as any of you know who know us or follow us, Azure is our premier cloud solution and one that we've been engaged with for a number of years. We could not be more appreciative of our partnership and the engineering work and strategy and the ultimate solution that we have the privilege to participate in with Azure and Joseph and his team. So Joseph, please, please join us. Thank you for coming. Thank you very much. Absolutely. Thank you. So Microsoft and Hortonworks have been great partners over the last four years, building a great cloud service together, building great customer value through contributions to Hadoop and Yarn and working together to enable solutions that transform the world with data. Thank you, Rob. Thank you. So today, I'm here to talk to you about the new Unreasonable. I call it Unreasonable 3.0. It is an unreasonable effectiveness of algorithms, cloud, IOT and data, acid for short. So I call it the unreasonable effectiveness of acid. So before I explain, let me start with a story. Now, there are 440 million children in India, more than the population of US, Mexico and Canada combined. And only about 50% of them attend school regularly and a very large number of them drop out. It is a world of lost possibilities. Imagine, for example, a child, let's call him Rohit, he is born, he goes to school, drops out, he lives in the suburbs of New Delhi. He could have been a doctor, saved lives. Let's take a girl, Ishita, born in a slum in Bombay, goes to school for a little bit of time. She has to contribute to her parents' well-being. She has to work. She drops out of school. She could have lived to be a professor, taught many other children, and saved lives of a lot of other people. And there are a million Rohits and Ishitas in India every day. And these are millions of children who never lived to their full potential as human beings. The state of entrepreneurship in India is looking to change that today with machine learning. Using data from schools, from school infrastructure, from teacher education, socio-economic status of kids from the Arthur database that they have, they are now able to predict when a student will drop out. They have risk scores for every student and every school. There are over 10,000 schools and 600,000 predictions that have been done to date. And this academic year, there will be five million children who are scored through that machine learning-based predictive system to understand the risk of dropout so that they can drive targeted interventions to save them and help them realize their full potential. That's the unreasonable power that data and machine learning brings to us. Now, what's all this unreasonable three-point about? Let me step back and tell you. In 1960, a famous Hungarian physicist called Eugene Wigner wrote a paper called The Unreasonable Effectiveness of Mathematics in the Natural Sciences. He was talking about the unreasonable power that mathematics has to describe the physical world, from subatomic particles to the very origins of the universe at Big Bang. And he said, we don't deserve this effectiveness. We don't deserve this explanatory power of mathematics in the physical sciences. In 2009, Alan Halvey, Peter Norwig, and Fernando Pereira wrote another very influential paper called The Unreasonable Effectiveness of Data. Now, they said, look, mathematics is very, very effective in physics, but in a lot of things we care about, language, economics, the behavior of people, plants, creatures. Mathematics is only effective to a certain extent, but where mathematics is not unreasonably effective, data is unreasonably effective. When you have large amounts of data, even very difficult tasks such as translating from Chinese to English or vice versa becomes easy and the accuracy you can get today with statistical natural language translation is unprecedented. And he called this Unreasonable Effectiveness of Data. Now, we are in a new age. We don't have just equations or algorithms and data, but we have the unbelievable power of the cloud. And we have millions, even billions, of connected sensors instrumenting every type of behavior on the planet. So all of that data becomes analyzable. That is the Unreasonable 3.0, the Unreasonable Effectiveness of Algorithms, Cloud, IoT and Data all coming together. Again, let's see a few examples of applications. Imagine if you will, our world with an acid layer on top of it, a layer of algorithms, Cloud, IoT and Data. What's possible? Here are a few examples. You can alleviate school dropout. You could increase food supply. You could improve transportation safety. You can treat genetic diseases and even perhaps cure cancer one day. You can optimize work environments, protect air quality, even prevent radiation exposure. And I'll take a few of these examples and show you how customers are realizing the value today in this world. So let's dive into that. Let's start with Unreasonable Effectiveness of Algorithms. Every business today is an algorithmic business, but it appears even in places that you least expect. For example, in India again, farming is being revolutionized with the power of data. And that's important. In 2015, by the way, about 5,000 farmers committed suicide because of crop failure and environmental disasters. And now, again, the state of entrepreneurship leading one in India is using data on rainfall patterns, data on crop failures, data on soil conditions, and combining all of that rich data with machine learning so they can predict in every village, every farm in a very localized way what the optimum time to farm is. They even send alerts to farmers, SMS alerts to farmers with the dates on when they should actually sow their crops. Another unreasonable effectiveness of algorithms. So now let's go look at another example. This is about the unreasonable effectiveness of the cloud. The cloud, by the way, is incredibly effective because of its elastic scale. You can command enormous amounts of computing power in a very, very short amount of time. And at the same time, it makes the development of applications 100x faster because you can glue together very powerful platforms as a service in the cloud to build reliable software applications. Let me now explain that power of the cloud with a customer example. People used to refer to the two pillars of discovery theory and experimentation, but now you get a look at computing also as a way of discovery. In the life of medical sciences, the days of the aha moment when the test tube turns blue and you've made all your discoveries is ending. The new life science discovery paradigm is massive amounts of data, new ideas for analyzing that data and the capability of doing it. If you go back to 2001, it cost $100 million to sequence a genome because all that technology has been commoditized. Today, it's less than $6,000. The sequencers that are out there are generating data so fast right now that 15 petabytes of next generation sequencing information needs to be analyzed on an annual basis. The data is doubling every eight months and our compute capability is only doubling every 24 months. So it's going to be hard for institutions to create computing resources to be able to do this alone. This is where the cloud comes in. It makes sense to have cloud providers like Microsoft Azure to provide those resources. What's exciting is that we have access to Hadoop-based tools. HD Insight provides us an interface that enables life scientists to grok this humongous amount of data that's coming out of next generation sequencers in a very natural way. When we first started doing this a year ago, it took two weeks to analyze one genome on our computer. Now we now do 100 genomes a day. So what excites me about what I'm doing in the cloud is the ability to accelerate discovery to the point that we may be able to find treatments for cancer. Algorithms, data and the cloud will change lives. Now let's talk about the unreasonable effectiveness of IoT and sensors. And that effectiveness comes from the ability to instrument every type of behavior on the planet, whether it be behavior of human beings or creatures or plants or machines or even servers in the cloud. Now there were about 500 million connected sensors in 2013. In 2020, there will be 50 billion. That's seven sensors, connected sensors, collecting data per human being on the planet. Imagine the power to analyze all of that and make a difference. Now a company, again I'll explain with an example of Schneider Electric, they are using all of this data to effect change all across for customers, both consumers and enterprises. Let me play the video. Schneider Electric is a global provider of energy management and automation solutions. We're in multiple industries from smart cities, connected oil fields, mining, minerals and metals, food and beverage processing, everything from mass market all the way to connected and efficient homes. We've been creating smart devices for years. When we add on to that the connected smart device and all the data that we can collect from those devices and we start to look at new revenue models, new business models that we can deliver, it was a natural fit for us to embrace IoT as the next step for Schneider Electric. And it helps a lot that Microsoft is targeting the enterprise like ourselves so that they can bring their expertise of consolidation and engineering efficiency to what we are doing as a group. Azure IoT provides us with a very open, flexible, highly scalable platform. We can now talk to projects that might have 1,000 devices or a million devices and we're comfortable that the cloud technology can scale elastically to support those situations. We're also using service fabric from Azure in order to do better process data, provide device logic and provide our own device connectivity framework. We're providing more value to the customer by being able to do more analytics on the data that we're collecting in the cloud. No one asks us to do IoT for them. What they ask us to do is deliver better value on top of the data that they are collecting. So now let's talk about all of that data, the unreasonably effectiveness of data. There'll be 40 Zetabytes of data in 2020 in the cloud and elsewhere. Now if one byte of data were one grain of rice, an exabyte would blanket the west coast and a Zetabyte would fill the Pacific Ocean. And a Yotabyte, a 1,000 Zetabytes would be an earth-sized rice ball. Imagine the power to analyze all of that data. Data allows us to see the unseen. After the Fukushima disaster, the nuclear plant disaster in Japan, a group of concerned citizens created an organization called Safegast to monitor their environment for radiation. They had Bluetooth-enabled Geiger counter devices to measure radiation that people all over could wear and send data to the cloud so that it could be shared and they could see what was happening in their environment. And here is an example of a map created with that Safegast data, analyzed with HD Insider, Hadoop Tool in Microsoft Azure. And look at the spread of radiation, for example, along the highways as people transported material. And now, citizens in Japan have the ability to interactively look at a map and see the spread of radiation. And it's not just in Japan, it's a worldwide phenomenon, Safegast, go to the website and you will see that. That's the power, data allows you to see the unseen and recognize the environment we live in. And it's not just data, it's not just the cloud. The cloud gives you the possibility to engage a vast number of creative human beings in crowdsourced efforts. We have, for example, put up now a machine learning competition on the web with open data from the Bill and Melinda Gates Foundation. The objective of this competition is to predict the risk of health risks for women in underdeveloped countries, such as HIV infections. In fact, you can also participate in this. Everyone here can. Go to that URL aka.ms slash women's health and using free cloud platforms and open data, you can model and predict when some woman in a poor part of the world is potentially exposed to a health risk. You can make an unreasonable difference. So, let's take a step back. At Microsoft, we are building great platforms to enable such applications to be built. Powerful applications that combine algorithms, the cloud, IoT, and data. It has components for information management, for big data, for machine learning, Hadoop tools. We have great visualizations, even a bot framework to build great chatbots. And our customers are combining them in a large number of really exciting scenarios. Let me share a small number of them. For example, customers take data from IoT devices. They bring great power of Hadoop tools using HD inside to analyze all of that data. And using a dashboard like Power BI, they visualize and create interactive maps like some of what you saw before. They land big data in an HDFS file system or Azure Blob, or in no SQL databases like HBase, or in Azure Data Lake, which is an exabyte scale Hadoop file system as a service in the cloud. And then connect HD inside to all of that and analyze the data at scale. And they even use HD inside to cook data in ETL processes and load the SQL data warehouse with it so they can do interactive querying. And the SQL data warehouse itself is an elastic traditional SQL data warehouse that's capable of scaling up and down. And more sophisticated customers use all of them. They bring all of these things together and HD inside on top becomes the place where they integrate data and cook it. And they layer predictive models using R server, Microsoft R server is a scalable imprimaration of R that can go against big data. It's parallel on Hadoop. It's there by the way on HD inside in the cloud and Hortonworks cloud on premises and Hortonworks distribution on premises and other Hadoop distributions. And so you can do analytics at scale and you can drive intelligent action from that. And even more sophisticated data scientists will use the Jupyter notebook. It is a great framework for combining code and descriptions and create experiments and share and collaborate across with them. So here's what's coming together. You're getting today the incredible power of algorithms, cloud, IoT and data coming together in open platforms with the power to engage very large number of developers in creative efforts driving a great amount of unreasonable effectiveness. In many ways, that's the future of Hadoop and analytics. And it's up to us now to be unreasonably effective in changing our world with them. Thank you, thanks. Joseph, thank you, please stay up here Joseph. So I wanna thank Joseph and Microsoft for their partnership, for the work that we've done together in the industry. Also wanna thank Microsoft for what they've done for the community and what you've done for the community. What I'd like to do is use this as an opportunity to talk about the community because in the end it takes a community to raise this elephant. It takes time and it takes people in a lot of grassroots efforts. So first, anyone who's a committer in the Hadoop ecosystem out there please come on on stage right now. Please stand up if you're a Hadoop ecosystem committer. Stand up and come on stage. Come on up, faster. So as you see everybody walking up, right? Come on up, just stand on stage here and leave the one white line right there in the center. So as everyone's coming up, it takes a group of people in a community and a grassroots effort to really go build this out and to work with this technology and to build out this whole ecosystem. It didn't just happen on its own, right? It's people like this who from the bottom up have been writing the code, working on the architecture, designing what's happening and really building everything that happens in the community and working with partners like Microsoft and others and look, there's a lot of them. This is great. But it takes a community and a group of people. Come on up, right, to go work through this. So we thought this was a great opportunity, one, to thank the community for everything that they've done, for all the tireless efforts they've done to go build out the technology to work with it and also to celebrate 10 years of Hadoop and to thank everyone for what's happening. So to the community, let's please give them a round of applause. We wouldn't be here without them. And what's great is we continue to see this as a thriving ecosystem of more and more people continue to join the community who not only want to participate but who want to contribute and want to give back and help continue to advance the technology and continue to build it out for everybody involved. So again, without everyone here, we wouldn't all be here. Congratulations on the next 10 years of Hadoop. It's been a great opportunity so far and I know all of you have great things behind you in terms of where the technology goes next. So thank you, everybody. I can sit down. All right, everyone, gonna return to your seats and you don't get a piece of cake. Thanks, everyone, you can go this way as well. Thank you. So while they're all stepping off, what I'd like to do is introduce our next speaker. So Arun Murthy is coming up and many of you have seen Arun talk in the past. He's talked about what's happening at Hadoop. He's talked about yarn. He's talked about technology and where it's going. But this time where we thought we'd focus a little bit on is now that we've had 10 years of Hadoop and we've got a decade under our belt of all these people in the community working, contributing, committing and driving the advancement of the technology, thought this would be a great opportunity for Arun to give a little bit of a perspective on what's the next decade like, what's possible, what's coming next and to walk through that. So Arun, as the founder of Hortonworks, it is a key candidate, thank you very much. Thanks, Arun. Well, standing room only, it's kind of amazing. I got to start off by saying thank you to all the folks on the stage because we've come a long way in 10 years. We start off in 2006, there are a few hundred lines of code, but as someone said, it takes a village to make this and you saw parts of the village. But then there's more. Anybody, it's not just the folks who contributed code, if you've used the product or the project 10 years ago, it wasn't really a product. If you have contributed documentation, if you've answered questions on the mailing list, you've all played a big part. So again, thank you. As you all know, the journey began 10 years ago. Back then it was, Hadoop was really simple. We were all part of an effort to just get some massive batch analytics for web search, which is sort of still think of the, still think of as the quintessential big data use case. We went from there and we had a whole bunch of, new projects show up, whether it's high from Facebook, Uzi, on and on, H-Pace from Power Set back then, which got acquired by Microsoft. So it was sort of the beginning of a journey. 2011, we stood up on stage and said, we made sort of a quantum leap with Hadoop. We've gone from having just a batch system to allow you to not only just put all the data in one place, but also as the data develops what I call as gravity, it allows you to pull in different applications, whether it's batch, whether it's real time, whether it's interactive. That's all of my favorite batch lab, right? Today, again, the ecosystem has gone even further, whether it's Hoc, whether it's Phoenix, whether it's Ignite, whether it's Spark, a lot of new conference have come in and that's really what you wanna see, right? What you wanna see is an ecosystem buildup all across around Hadoop. Now, what's also been exciting in the last five years is not only do we have the open source communities and all the folks who are up here contribute to the growing Hadoop ecosystem, but we also have the wider enterprise vendors, whether it's IBM, whether it's Pivotal, GE, come on, and that's been a big deal with the ODPI initiative, right? So we're there. Next up, we continue to push there. We've had Hadoop be effective, not just on-prem, but also in the cloud. You had Joseph from Microsoft. HD Insight, again, his team has been an amazing sort of contributor and a collaborator in the open source community, right? It's sort of the new Microsoft and having them work along with us to push HD Insight back at, you know, our premier sort of cloud partner has been amazing, right? So what I wanna do is sort of me doing all the talking, I wanna have Assad up on stage, right? Assad is sort of the group manager for HD Insight and he's gonna talk a little bit about how HD Insight and Hadoop are making an impact on industry. Thank you, Aaron. Hi, I'm Assad Khan. I lead the big data service in Microsoft called HD Insight and I'm really excited to be here with you in San Jose. It is amazing to see the transformation and the growth that the open source technology like Hadoop and Spark have gone in the last few years. Most of the customers that I work with are the HD Insight customers are the Fortune 500 companies. These companies already have a data pipeline which is built on the technologies like CRM and OLTP and Data Warehouse. And these companies are now looking to transform and adopt the data driven culture as Herb and Rob talked about. And in order to do so, the key assets they look for is technologies that are cloud ready, that are enterprise ready and those are productive day one. And these are the some of the key areas that my team work in collaboration with the open source contributors. So in the next five minutes I will show you some bits and pieces of that through the demo. What I have over here is the Azure portal. This is a single pane of all the assets I have in the cloud and through this portal I can spin up any size of the Spark and Hadoop cluster within minutes. When the cluster comes up, the cluster is already optimized for your workload. You can scale up and scale down that cluster. You can have any application deployed on top of that, whether it is a Microsoft application or it is an application built by any other company. In this case I have our server running on that and I will talk more about it in a minute. It provides you a set of tools which gives you productivity right away. All of these tools are open source, which are built in collaboration with the community. Any innovation that we do goes back to the open source ecosystem. For this demo I will go and use Jupyter. Again, Jupyter is one of the widely used tools by the data scientist. It is a collaborative environment which runs on the web and for this scenario what I have done is that I have taken the data set which is from the Bureau of Transportation of US for the last 20 years every flight that took off in US. And now I can go and I can go and do a quick count on the data set and again it will take the JSON file, it will schematize it and then it will create a temp table for me which is flight and then I can go and write any query. Again, very quickly I could able to run the query and again this has 160 million flights which happened in the last 20 years. And again I can go and build more on it and I can ask interesting questions. I'm interested in the pattern of the delayed flights and how it relates to a certain airport. And again as I get the result it is very hard to parse this result in a tabular form. So I can use the inbuilt visualization again that are all built in the Jupyter notebook which is now in open source community. And through that I can go and all these visualizations are interactive which means I can go and zoom in into any area I want and quickly look into the airport which has the most delays. And in this case it happens to be MKC which is the downtown airport in Kansas. So building more on that again this is a very very simple example. Most of the time what enterprises are doing is taking this data set or taking the data set that they are generating from their own enterprise and joining it with many other data sets and doing modeling on top of that. What I will do quickly is that I will switch to the other notebook. Again as you will notice the same environment I just switched from defining the schema in Scala then started writing SQL. And now I will use R to go and build the model. What modeling gives me it gives me both a view on the historical data as well as it gives me prediction. And in this case what I'm using is the R server. R server is the most is the highly optimized scalable algorithms available to data scientists. Today most of the data scientists when they are working on the computer those algorithms are written for a single machine. It will not go and scale it to terabytes or petabytes of data that you have in your enterprises. And that's where the R server comes in. And it is not only available on Spark or Hadoop it is available across various platforms. Sometimes we call it Voodoo, write once and deploy anywhere. It is available on SQL, it is available on-prem, it is available on Teradata, anywhere you want to go and run those algorithms. So in this case what I did was that I used the same thing and used linear regression to give me a continuous values. And I use the days of the week and that now allows me to ask questions like what are the number of flights that happen in a day? And again now I'm going over all the 160 million records from the last 20 years. I can go and run that thing and it gives me a nice chart. And again now since it is modeled it is even more interactive. As you would expect there are more flights that happen in the weekday. Towards the weekend it drops and again it starts to pick up on Sundays. Similarly now I can go and look at the average arrival delay by the week. If I go and run that thing it gives me the graph for that thing. Isn't that amazing to see that during the early of the week most of the flights are on time but when it's time to go home on Friday most of the flights are delayed. So again now I will go and use one more modeling technique which is the classification. And I use the 15 minutes interval as a classifier. And then I will do an interesting thing. What I will do is that now I will go and do prediction because up till now what I'm looking at is the historical data. And in order to do the prediction what I will look for is that today is Tuesday. So the conference finish on Thursday most of us will take flight back home. And the airports that we will choose will either be San Jose, San Francisco and Oakland. Why don't we go and predict what will be the delay from each of those airports. So if you go here again I have the days I pick Thursday out of that. I pick all the three airports. I pick the month which is the June and then I will go and model that thing. And if I run you will see the chances for the delay. And again if you pick San Jose you are lucky as per the prediction that will have the least delay. So in the closing remark I will say that it is amazing how companies big and small and individuals like you are coming together and contributing to a single code base and building platforms. And these platforms are used across the world to solve all type of business problems. And that is really exciting to see. Thank you very much. Thanks. That was really cool. Yeah. So now you all know why we have the Hadoop Summit in San Jose and why it ends on Thursday. Right? So that's great. Again it's like I said big props to Microsoft. The way they've come onto the sort of the ecosystem and contributed to Spark and Hive and Hadoop especially yarn and so on is kind of been amazing to see. And we hope to see more of this. So all that we talked about for the past five or 10 minutes has been about how we've gotten here, where we've gotten. Now let's look a little bit ahead. What's next? We spent 10 years in this. The first decade is over. We have a 10 year old at this point. I saw last week Mer from Gartner tweeted that hopefully you guys can help Hadoop get through the difficult teenage years. We want to make sure that we don't have a particularly troublesome teenager. So you got to step back and sort of understand what people are doing with Hadoop today. End of the day at HDP, when we started off in 2011, gosh it's five years, it's flown. The very first iteration of Hadoop, the hard work data platform, had about eight to nine components. Single digits. Today, we have 25, 26, 27. I can't keep track of it, even though I don't know all of it. But really what's happening out there is as we push all these technologies out, customers are actually trying to solve a business use case. And that business use case is something, it's usually an application. And that application is trying to deal with massive amounts of data and try to come up with an insight and either do something simple, like build a BIA or a reporting tool or hopefully something more interesting which builds a predictive app. So let's take an example. And I call these sort of the modern apps. And from my perspective that I stand, it's every sort of app, modern app that's being built is a data app. It fundamentally uses data in something in a really interesting way to come up with an insight and help drive your business outcomes. So let's look at what we've been doing in the financial services in the Northeast. So here's a modern sort of credit fraud app basically what's happening here is you're getting a lot of data from a bunch of sources, so custom, social, and webs and so on. Now all of this data is being user behavior. It's being used to drive a predictive model that you can see. Now those predictive models are also, you've got this customer service analysts who are also doing some sort of manual retraining of the model given some customer use cases or support calls and so on. And finally, a lot of these insights are being used to actually go back to the customer, close the loop and say, send your text saying, OK, I saw that your credit card got used by somebody in this specific, maybe in London. We know you're generally around in the San Jose area. Is this a fraudulent transaction? And then that feedback comes from the user. You go use that to build the model again. So this is an example of how people are building applications using this technology we sell. Now if you peek under the hood, you use technologies like Kafka and NIFI for ingest. You have Storm for real-time event processing. Spark to build the models. HBase to store and serve the models and push it back. So this is an example of what I call a modern app. Now it turns out that it's not particularly easy to actually put this up today. We spent the last 10 years building up all these pieces of technologies, but it's actually not particularly easy to stand all of this up in your data center, in your Hadoop cluster. You've got to secure it. You've got to make sure it's highly available. You've got to make sure you have DR on and on. As you build these sort of requirements into it, it's not particularly easy. And that's where you see the troublesome teenager aspect. Now look forward and say, it wouldn't be easy if you could just download the credit for our app and run on your HTTP cluster. You don't have to go download Kafka. You don't have to go download Storm. You don't have to put them together. If somebody has done it, I would rather pick it. Hopefully somebody has done 80%, 85% of the work. And I'm happy to customize the last 10%, 15%. That's what we want to take Hadoop into, which means you can go select your engines and services. Engines are like Spark or Hive. Services are like HBase or Storm. You then wire them together. And equally importantly, make sure that you have a user-friendly UI and UX on top of it. Because when you go into that cluster which is running this application, you don't really want to see Storm and Spark. Although these are important technologies, you don't want to see them individually. You want to see the credit for our app. And last but not least, you want to secure and operate them as a whole. You want to secure the entire credit for our app. Make sure the data is secure. Make sure the applications are secure. Make sure you can operate it as an application, not as a set of individual technologies. Now that's really where we want to take Hadoop. And thankfully, a lot of technologies have emerged in the last couple of years. A great one is Docker. So imagine a world where you can actually take these applications, burn them as Docker containers, and then just download it and run on your Hadoop and Yarn clusters. So keep that in the back of your mind. I want to use this opportunity to bring Winode up. And Winode's going to actually walk you through a real example of how you can actually get this. Winode, walk on board. Winode's been around for a long time in the Hadoop ecosystem. And it's been sort of my privilege to count on him as both a colleague and a friend. And he's been one of those absolutely key people for the Hadoop ecosystem. And he's going to be really key to take Hadoop power to. What he's going to do is actually take that example we showed, which is the credit for our app, and make sure you can see how easy and how simple it's going to be, hopefully in the next decade of Hadoop. Thanks, sir. So it only hit me just today after all these celebrations that I've been working on Hadoop and only Hadoop for the last nine years. So that's been amazing. Thanks to the Apache Software Foundation, Yahoo! Hortonworks for giving this amazing opportunity. With that out of the way, so I want to give you a quick glimpse into what's happening next. So what you're seeing here is a Hadoop cluster with a familiar view of Ambari managing it. You can add machines. You can add services, monitor, look metrics, et cetera. Moving on to the futuristic stuff. So what we have is the notion of an assembly. So like Arun mentioned, we want you to focus on the end business use case. So instead of taking the transistors and the diodes and the capacitors and actually building a computer yourself, we want you to assemble things that already exist and build a business use case. So what we have here are already built applications. So credit fraud assembly is one thing that Arun has already gone through. Log search is an assembly built out of solar and so keeper, et cetera, mainly for analyzing and visualizing all sorts of logging events that are happening in Hadoop cluster. To the extreme is Apache Metron, which is a cybersecurity solution built out of five or six different components and satisfying one end to end business use case, which is highlighting the security anomalies in your cluster. So let's zoom into one of these applications. So one of the goals that we have is, like Arun was mentioning, we want to make this deployment of this end to end application very easy. So credit fraud here is made up of your familiar data services like Kafka, HB, Stom, ZooKeeper, and NiFi. It would take as simple as a click to deploy this application. Again, what we want you to focus on is not where HPAC is running or how many connections you have to open for ZooKeeper. Instead, the user would define the number of transactions he needs and the number of analysts who will be interacting with this application. And as simple as click and deploy, it would go and run it on a Hadoop cluster. Obviously, we'll not wait for that. I'll take you to a list of assemblies that are already running. The beautiful thing about this is that all of these applications and individual components, they are running on top of Yarn. And Docker. So bringing a five-container cluster of credit fraud assembly is no difficult than running a 1,000-container application. So what we have here is the credit fraud assembly that is already running the data components like HBase, Kafka, and so on. So you can click on each of them. Just like you manage the entire Hadoop cluster, you can add HBase region servers. You can add Stom, topology, workers. Similarly, there are non-data components like Comet and the UI server itself, which is what the business analyst will interact with. And the real beauty of this entire thing is you can bring this all together and bring it down together. Or you can even do things like a CXO could come in and say, hey, suddenly there is a massive inflow of events. I want you to scale this entire assembly as a unit together. So you no longer have to interact with the individual components. You play at the business layer. Now, going under the hood, if you want to see the definition of this assembly, again, the notion of reuse and using transistors and capacitors. So this application is made up of a whole bunch of different components. And as a user, you do not need to worry about how Zuccooper is actually running inside the cluster, how CometD is, and so on. There are also more complex notions like I want to start my US server only after Zuccooper is up and running, HBase, Kafka, they're all set up. So this is a very simple rest-based definition that you can use. Going back, in addition to also building these applications easily, we want you to manage and monitor them. So I already talked about starting this entire thing together. You can also look at the metrics of each of these individual components. So this is a quick glance at the HBase and Kafka components that are running as part of this credit fraud assembly. To close the loop, how does the user actually interact with this application? So there's a link here. This is the actual queue of transactions and events that the analysts use. He'll go and look at this fraudulent transaction, a transaction which is marked as fraudulent. He can do a quick previews. He can see why the reason, what's the reason for this transaction being flagged so. And he can zoom in and get a larger context into who's this customer, where has he been in terms of geography, what are the various transactions that he has done in the past. And he can make it a business decision. So all of this, like I said, is running on top of Yarn Docker. If you want to build your own tools, the REST API is available. In addition to this, we also want people to interact with this application via shell. So we have a shell which basically lists you all the packages that are available. You can do things like list me the application, submit a new one, clone a new one, all the DevOps work flow. Bottom line, yeah, they're all running on top of Yarn Docker. You get the same resource management functionality. And one of the other focus that we have is we want to make these robust building blocks so that each of you don't need to reinvent how you run HBS, or Kafka, Spark in the data land, or even if you want to run Tomcats and Jenkins, you can essentially click download already existing applications, and off you go, you build an assembly. That's all I have. Thank you. Like you saw from Vinod's demo, we really, as a community, have to take sort of Hadoop into the next decade, where it becomes quicker and easier to get value out of the technologies we sell. This is how we as a community continue to make a larger impact in the world and allow people to get value out of their Hadoop deployments. Now, just to be clear, again, what are the design principles we start off with? The design principles have been always, it has to be easy to use and operate. It has to be, like Vinod showed, you want to look at an application as the credit for our app, and then scale it up and down based on the number of transactions. You don't really want to be scaling the individual Zookeeper or the individual Spark or the individual HBS instance within it. And by the way, if you want two different instances of the credit for our app, you can run both of them. You can have a beta, and you can have a production one, both running in the same cluster. And they'll come up with their own versions of Kafka and Storm and HBS individually. Now, it also has to be repeatable and portable in the sense that you want to be able to absolutely deploy the same application, regardless of what cluster you're running, of what version of HTTP or Hadoop you're running. That's why, again, something like Docker is so important, because it allows us to isolate the application from its environment, and make sure that you have a consistent environment into which you can deploy your application and not get surprises. Last but not least, again, this is sort of the sign of us growing up, security and governance have to be taught off from day one. When we worked in Hadoop, initially in 2006, we spent a couple of years trying to get it working, and then we had Owen and Devraj and everybody else lose a lot of hair trying to put security back in. That's a pattern we don't want to repeat again. So talking of applications, talking of governance security. Now, we've talked about it from an application and isolation standpoint, but equally importantly, governance and security for data is important. So if you go to an enterprise today, you have stream data, you have data pipelines, your feeds coming in. And what the enterprise wants to do is set a bunch of policies on that data, whether it's prohibition, whether it's classification, whether it's lineage, whether it's provenance. So that's been sort of a missing piece in the Hadoop ecosystem. And that's why one of the things I'm really excited about is all the work we've been doing in the Atlas and in the Ranger communities. If you're not familiar with Ranger, it's part of the acquisition we made at Hortonworks from XA Secure. We gave you one stop shop for security, but then security and governance are very, very later. You can't do governance without security or vice versa, which is why when we actually started to build Atlas with a bunch of key partners, whether it's AITNA, whether it's Merck and Target and so on, one of the first things we had to do was marry them together, right? Because now what we have is the capability to track metadata and lineage across all the components in the ecosystem, right? Whether you're doing high, whether you're doing Spark, whether you're doing storm, whether you're doing Kafka, you want that lineage and metadata. Now, equally importantly, what we've done in the Ranger is to give you the ability to control access to that data through the metadata, right? From what I mean by that is you can now start to tag all your data sets and have a business catalog of your data set, and now you can put access policies not on the individual tables and columns because they are sort of ephemeral. You'll have thousands of tables and columns, but you really want to be able to do it at the metadata level. You want to tag a data set as PII, and regardless of who copies it or how many copies are made, those tags automatically get inherited. Now you can start to put policies on the tag itself, right? So you don't have to go govern individual data sets, you can govern the tags, right? So that's, you know, again, very, very important as we take Hadoop and make it sort of the system of record. That's something that, again, big props to the community for having, putting a lot of effort there. And again, it's folks like, not just folks here in the valley, but it's Aetna, it's Merck, it's Target all over the enterprise, right? So with that, I just want to give a quick highlight of what I would find really, you know, what I'm going to, what are the tags I'm going to. You saw the Vinod's assembly one, there's a bigger talk, a more detailed talk if you want to learn about it. Atlas and governance. There's a bigger, there's a deep talk on it by Andrew. Last but not least, Carter and Bosco are giving a talk on finding security for Spark in the sense that in the so far we've had security at the data set level, what we've now got, usually technologies like Hive 2 and LAP and Ranger is you can actually do column level security, you can say only Arun can look at the PII data, and row level filtering, which means, row level filtering and masking, which means Arun can look at only rows which have a value greater than X, right? So those are sort of continue to be the sort of the capabilities we want to bring generally into the platform. It's not just for Hive, it's not just for Spark, you get it consistently across your entire ecosystem. With that, I'm gonna wrap up. Again, thank you so much for joining all of us. It's the ninth one, and we finished 10 years on the project and next year we'll be at the 10th. The rate of which we're growing, we probably won't fit in San Jose, but you never know. Thanks. Thanks. Thank you Arun. I want to thank Arun before he goes off for his contributions to the community over the last decade, and for now everything you're gonna do over the next decade. So thank you. Arun, thanks. Thank you Arun. All right, so what I'd like to do now is we've talked a little about technology, talked about the future, where we go, with assembly, security, integrated governance and all those components. Thought what would be good now is to have two different companies come up. Two end users, two customers, talk about what they're doing and how they're leveraging the technology and what it does for them. So first I'd like to do is introduce Adam Wenchel. Adam is Vice President of Data Intelligence and Security at Capital One, and he'll talk about the connected platform as you look at things like cybersecurity. How do you leverage the whole platform, Candle, to find the instances of the bad guys? So with that, Adam? Great, thanks Arun. Thank you. Yeah, so, thank you. So those guys already covered credit card processing, so I better come up with something different to talk about one second. So, hold on one second. All right, so today I'm gonna talk about cybersecurity and specifically talk about Metron and how Capital One is using it to provide advanced security, defenses powered by machine learning. So the world is a scary place. Every week we hear about new catastrophic breaches, right? All over the place, OPM, Sony, lots of places. Just last week the Democratic National Committee got hacked and the refrain you hear every time one of these happens is how could they possibly miss that? You know, there were signs all over the place. The media is happy to pillory people get breached and you hear pundits getting their 15 seconds in the spotlight and it's true that in all those cases there were signals that people just didn't quite put together to detect the hack. But the reality is infiltrating large complex corporate networks and moving around unnoticing them is actually not as hard as we like it to be. And effectively securing and monitoring those large enterprise networks is a lot easier to say than it is to actually pull off. So let's look a little bit at why it's so challenging. For starters, scale, right? On our network, we're currently ingesting about a couple hundred million events per hour during peak times and by the end of this year we'll be peaking at over a billion events per hour just on our internal corporate network and that's not even including all the external facing stuff we have. There's a huge variety of data coming in that you need to get the complete picture. And getting that complete picture and turning on those data flows is wildly complex. If we want to add, you know, syslog ingest or data from our routers, we have to turn across, we might have 5,000 routers that we need to get all shooting data to the same place. It's not simple. Same thing with network sensors that we deploy. We need to deploy dozens of these things to get coverage across our network. So when we talk about that complexity this is actually a good example. This is just part of one of our syslog flows. As you can see we're using HortonWare HDF and Apache NiFi to power that just to manage that complexity. We have a huge amount of complexity as we pull in this data from across our enterprise. And each of these data flows requires coordination with multiple teams, right? So whether it's the IT team or compliance teams or HR, depending on the sensitivity of the data there's just a massive, massive undertaking to make sure everyone is aligned with getting that data in. We have to watch out for compliance hurdles. We're obviously in a very regulated industry and we want to make sure we're being responsible stewards of all this data. And I'll tell you that none of that speeds it up. I've yet to find a data flow where the attorneys have come in and kind of sped things up as we're trying to get this stuff online. So maybe it'll happen one day. The other thing is visibility, right? So with most security appliances, when you set them up you have to configure and tune the rules about what you want to alert on and what you don't. And typically you're forced into making a hard decision about do I want to see everything, get a lot of false positives and just be blinded by all these alerts, the vast majority of which are meaningless, or do I want to tune it so that if it goes off I actually know there's something bad happening but I'm probably missing a lot of stuff as well and that's kind of scary. There's really not a good decision in that situation. You just, traditional rule-based systems don't have the ability to generate a high signal to noise ratio. And then once you get all this data in place making sense of it is a real challenge. So because all these solutions, these commercial solutions tend to be very point specific. They do their one job and they do it well. Tying together all those threads, pulling them all together and getting that full picture is incredibly difficult. So just consider an example scenario. A person gets a malicious email, clicks a link in it, gets tricked into clicking a link in it, accesses a bad website and then gets tricked into downloading a piece of malware onto their computer and that malware may start doing something like surveying the local network so that they can do some reconnaissance internally and look for sensitive data that might be valuable to cyber criminals. In order to pull that off, you need to be able to pull, in order to detect that, you need to be able to pull those threads together and do it very quickly, generally in milliseconds to before because damage can happen so quickly. So just for that one scenario, we're correlating events from the email servers, the proxies, the NetFlow from routers, firewalls, a number of different sources, syslog and we're having to do it all in real time, near real time. And on top of that, if that's not enough, we actually wanna be able to train some machine learning models and we'll dive into a little bit more than that so we're really asking a lot of our data lake. So the desire that scenario, being able to accomplish those things we wanna accomplish is what gave rise to Apache Metron. Apache Metron, just to give a little history, was originally released about 18 months ago in December of 2014 by Cisco. These days, Hortonworks is actually the biggest contributor of code and expertise to the project and we at Capital One are doing our part as well. We're big believers in it and we contribute a great deal of code and know how to the project as well. Metron, what Metron is, it's really kind of a production ready reference architecture for big data cybersecurity. It uses a lot of the battle tested technologies under the hood that we're all used to to power big security data ingestion and analytics. And it gives analysts and sock operators and enterprises the ability to do the things they need to do to secure their enterprise. Things like real time data exploration, enriching streaming data with threat intelligence feeds that we source from a number of places. And doing, really being able to dive down in detail and do things like peak app analysis where you're actually going packet by packet to analyze network traffic so you can actually see exactly what went on in your network and do appropriate forensics when you think you might have gotten a malicious actor on your network. It features, as I mentioned, a lot of the technologies under the hood that a lot of people use for many different use cases. We have storm topologies for doing streaming and enrichment of the security data and they're fed by Apache NiFi and Kafka. That is flow the data into storm. The peak app analysis I talked about is powered by HBase. And the project also features a lot of great code specific to Metron that like parsers for a lot of popular security data sources. We've actually contributed a lot of those and are gonna keep contributing to them and are looking for others to get involved as well. So I mentioned earlier about the problems with rule-based systems. At Capital One, we believe machine learning is fundamental to cybersecurity, to ferreting out the signal from the noise. With the amount of data that we have flowing in, it's the only way we're gonna be able to connect dots at the scale we need to work at. So we're augmenting Metron to do ML at scale and at speed. And the three big goals we have when we're doing that are these. We need to be able to train our models very quickly. We need to deploy them quickly. And we need to execute them quickly. So let's talk about each of those for a second. So training, we generally train our models using Apache Spark underpinning various ML tools. Like a few examples. We use H2O quite a lot. They're a valuable partner for us. And that one is actually very performant, trains very quickly. Sometimes we'll employ other tools like TensorFlow or Scikit-learn. We really allow the data scientists to use the right tool for the job. But we need to make sure this is performant. And so why is it so important to train these models very rapidly? The answer is because cybersecurity is a very highly adversarial domain. So you'll actually have malicious actors going in there in real time and trying different things out. Imagine you're crafting malicious emails and you try one, it doesn't work. You're gonna keep crafting malicious emails until one works. And it only takes you a few minutes for each email. You're just gonna keep trying them and trying them. So what we want our models to be able to do is to actually be able to learn from each of those attempts. And in order to do that, we actually need to be refreshing our models multiple times an hour so that as an attacker is evolving their tactics, our models evolve with that and can detect the newer and newer tactics much more quickly. If we only train things once a week, it would be trivial for that attacker to get through. They would just keep trying until they found a weak link and then exploit it. Deploying quickly. This is again something we've invested a lot of time into. We're actually setting up a framework for our model pipeline that uses microservices that are called by our storm enrichment topology so they can actually score events and alert on bad events in real time as they're going through. And the other thing about deploying quickly is when you're deploying things this quickly, you need to be able to provide a lot of the governance stuff and there's things that can go off the rails. You don't have time to actually sit there and run them through quality assurance and have people hammer on them for a while and make sure they're doing what you need to do. And so you need to build in a lot of model governance and assurance around to keep them on the rails. We're also doing some neat things around CI CD for models. So before we deploy them, even though we're deploying them very quickly, we'll actually run them against historical data sets and compare them to the performance of previous models, the models we built five minutes ago or five days ago just to make sure that nothing's going haywire and so we can deploy them quickly but still have that confidence that we need to know they're gonna do their job. And the last thing is execute quickly. And so this one is fairly obvious and we're very fortunate in that we have a lot of experience with trans fraud as those guys mentioned. We were talking about earlier, transaction fraud and that's something where performance is paramount because when someone swipes their credit card I don't wanna wait a while for you to get an answer about whether the transaction was accepted or not. And so with our transaction fraud work, we were actually able to get the execution of complex models down to under two milliseconds. And so we're leveraging that expertise in our cybersecurity stuff as well so that we can execute quickly and that way an analyst in our security intelligence center gets notified as quickly as possible if there's a potential breach so they can get ahead of the attacker. So what does it take to get there, right? We've been climbing this mountain for about eight months now and we have a team of about 20 people at this point. We've partnered with Hortonworks, H2O, Mantek, a few other people to help us get started on the journey and really accelerate our efforts because we knew we had a very ambitious agenda and we wanted to be able to get there very quickly but already we're starting to see value from the investment we've made. And it's actually just a month into it, just a month ago, about six months into it, we actually found our very first threat on the network. It was actually a variant of the HWIRM, it's a visual basic script remote access trojan that originally was authored by a hacker in Algeria. And that was something that our traditional security tools missed because they weren't able to tie together all the threads and when we had an analyst going in there that was able to really do data science like exploration, just on the first day using this platform after we made it available to them, they found it. So that was a really impressive win. And each month, even in the month since that was found, we've gotten better and better and we're finding more and more stuff. And so even though it's taken an initial investment to get to this point, we're already recouping the benefits from that and it's been awesome to see internally as people get excited about some of this stuff. So that's Apache Metron. As I said, there's a lot of people in this audience who are working on it for Hortonworks, for Capital One and others. And we're really looking, we're very excited about it. We're excited, there's more and more people getting involved every week. I get emails from people wanting to talk about it. And so if any of you guys are thinking about building a similar cybertech data lake and want to talk about better ways to secure your enterprise, by all means come talk to me, hunt me down after this or talk to one of the many people involved with the Metron project here later today, because the more people that join, the better it is for all of us. Thanks. Thank you, Adam. Thank you, Erb. Thank you, I mean, thank you for your support as a customer, but also just for sharing what you're doing with the platform and how you're helping on cybersecurity. Absolutely. Thank you very much for the contributions to Apache Metron. So what I'd like to do now is introduce progressive insurance. So I have two folks here, Pavan and Brian. We're gonna talk about how they leverage the platform for segmentation, customer segmentation, and also as they look at other disruptive ways in their business model around areas like usage-based insurance and some of the other things they're doing. So Pavan and Brian, come on up, guys. Welcome. Thank you. Thank you, Brian. Good morning, everyone. Welcome to the keynote. So I'm Pavan, and this is my good friend, Brian. And I'm the data and analytics business leader at Progressive, and Brian's our innovation strategist. It's a pretty cool title, Brian. Now, how does this translate into real-world titles? So my title is somewhat of a chief data officer and a chief analytics officer, sort of a combination of that. And Brian's our chief troublemaker, as you'll find out. Put in simpler terms. I'm the business guy. He's the technology guy. And along with many of our colleagues at Progressive, we enable data science. That's basically our job. Progressive, as many of you may know, is based in Cleveland, home of the NBA champions, Cleveland Cavaliers. So we're based in Mayfield Village, which is right outside of Cleveland, and this is our campus. It used to be an old golf course, and now we have a whole nice campus on there. And at Progressive, we have reverence for data. It's basically what we do, right? We're an insurance company. Everything revolves around data for us. And data, big data, they're basically synonyms for us, and it's all baked in in what we do. One of our key tenets at Progressive is how we use the data, and we call it segmentation. So let's kind of dive into this a little bit. So what is segmentation? So let's say I have a whole group of customers, and I would like to segment these customers. I can look at them by driver's age, right? That's a nice category of customers. I can also look at driver violations. These are, you know, how many traffic accidents, et cetera, et cetera, you may have had. I can go two ways on this. I can look at driver's age and driver's violations, right? And so we can keep going and going and going, and the really art is an art of segmentation that we need to continue to do. Now, each of those variables actually has a certain predictive power, right? And so violations is better than age, et cetera, et cetera. And we have a whole slew of variables we use. There's probably upwards of 30, 40 variables we use, and each of those has a certain segmentation power, right? And so we got to continue to sort of refine our segmentation. Now, the reason why we do a lot of segmentation is we need to find the ideal price for each customer, right? We would like to get as fine-grained as possible to get to the ideal price for each customer. So that's basically what we're trying to do. So in our quest for segmentation, we launched a product called Snapshot. Many of you may have heard about this. Well, it's an IoT device, right? You plug it into your car and it sends us data. Is it predictive? Let's find out. You'll find out that while we have a whole slew of variables that are our traditional segmentation variables, Snapshot and usage-based is actually truly much more predictive in all the variables we use. So that's pretty neat. So what kind of data do we collect from this Snapshot device? Let's find that out. So there's two pieces of data we collect. One is time of day, what time you're driving, basically, and your speed. And we collect this data every second. So it kind of just rolls like this, right? And this is pretty neat. So we have lots of cars driving around, collecting this data. We recently hit a milestone. We have now 15 billion miles of driving data. That's a pretty large corpus of data for us to use. So what's happened over the time is we've got a growth in data, which is exponentially growing, which is great for us. But our processing time of the data has also exponentially grown. And this is a problem because we cannot keep up with the data that we ingest to process it, score our customers, and so on. So what did we do about this problem? We got folks like Brian. And they have helped us put a system together to sort of help us with this. So Brian can explain what he did. Brian? Thank you, Povin. So not only am I the chief troublemaker, I'm also the chief problem solver. So when I see charts like this, a lot of people start to think, oh, no, we have this huge problem. What are we going to do about it? But I get excited by this because I know how to solve these problems. And I bet a lot of you out there know how to solve these problems as well. So let's take a step back and let's look at the situation. This is what we have. We have insufficient processing power, insufficient storage. We don't have a good way to meet our business needs. Our analysts and data scientists have a lot that they want to do, work that they want to get done, but they can't because of this. And it's not the computers that are saying this. You don't see these errors. It's the analysts and the data scientists that are telling you, I have this problem. So what did Progressive do? We invested. We invested in technology. We built out our data centers with Hadoop clusters. We leveraged Hortonworks in order to produce a distributed framework to process this data. We networked it together. And at the end, leveraging both cloud and our existing data centers, we had a great end solution. And a lot of this was due in part to the support of our CEO, Glenn Renwick. And I just want to thank Glenn for support in this project. He's retiring after 15 years of CEO in two days. So thank you, Glenn. So yes. So what did the result look like? How did we do? This is where we were. And then this was the result. So this, to data geeks, is success. This is what you want to see. Our processing time was ramping up. We implemented the technology fix. We're down, way down now for our processing time. And we're able to meet all of our business needs, which is outstanding. So given that, we've solved our scalability issues. Where do we go from here? Where do we want to go next? What is the future? Well, we were talking before about segmentation. And we're looking at violations in drivers. But I know that you are more than just 31 to 35 years old with one violation. You're an individual. You're driving behavior. How you drive, how you act as an individual, is much more predictive than just being lumped into a bucket. So what does this look like? What does the individual look like? Let's take a look. So this is a simulation. I'm going to show you what our snapshot program looks like. This happens to be one trip from one of our customers. I hope nobody gets carsick in here, so just bear with me. This is actually me, so one of my trips. I'm not sharing our sensitive customer data. I'm going to speed it up here so we're getting more data collected as I'm driving down the street. And I can show you the type of data elements that we collect. So PubMed said we collect speed and time, and that's true. That's what we score on. But we also collect additional data elements for research purposes. Longitude, latitude, altitude, all those tubes. We collect all of that in the accelerometer data into our big data systems in order to score this. What we do with that is we start analyzing it. We can look at this. We can start plotting this out to look for trends in the data. How does a specific driver operate their vehicle? Is it risky or not? At the top, we see here a chart that shows the speed of the vehicle. The middle one is acceleration. How fast are you speeding up or slowing down? And the bottom one there is the direction. What direction are you heading? We can also bring in external data from the data we collect from our devices about where you're at. This is a list of roads. So I went from Main Street to Church Street to High Street to West Street. And I have all these attributes about the roads. I know the type of road that it is. It's a urban road or it's a rural road. It's a divided highway. These are very useful pieces of information. If I'm doing 65 miles per hour on the highway, that's great. That's safe driving behavior. It's not safe in a school zone. Now we know the difference here. So let's put this in context. Let's look at where this is at. So I'm this yellow school bus looking thing here, driving along. And when we start analyzing the data this way and visualizing it and understanding what our customers are doing, we come up with a lot more insights. It really powers our data science community internally and progressive to come up with new ideas. One thing that we came up with, let's look at where high accidents occur, high accident areas. These red cylinders here represent where there's a higher frequency of accidents at that location than other places. So as I'm driving this yellow school bus looking thing through a high accident area, if I'm not going at the right speed, if I'm not being safe, maybe my insurance rate should be adjusted. If I'm not a safe driver, I should pay more for that. So taking this all and step back, we look at other external data sets. We look at weather data. We look at traffic data. We look at all sorts of external data to enrich our predictive models and really figure out how are you driving as a customer. So that is the micro view. That's one individual driver. But we can also look at the macro view. So here's a map of the United States and we're gonna light this up with where all of our snapshot customers are at and we can animate this. We can play it over time. So here we see where we have any active drivers and the east coast wakes up first and it follows the sun over to the west coast. It almost looks like a wave. It's very organic and it gives some indication as the live in waking up at 4 a.m. here on the west coast since I'm usually on the east coast. So very interesting to see these big macro patterns when we analyze data this way. So what does this mean? What's the next generation stuff look like? And PubMed said, a snapshot is our most predictive indicator of any sort of loss at progressive. It's very valuable. The next generation stuff that we're looking at that leverages location data in these external data sets is even more predictive. It's extremely valuable for us. And why is it valuable? Well, it lets us put you as an individual at exactly the right place on that pricing curve. We know how risky the drive you are and we know what the perfect price is for you as an individual. It's as fair as possible as we can be. It's all about perfect pricing and being fair to the customer. So what's the overall value of this? Why do we do this as a company? It's valued to progressive, yes, but it's also valuable to our customers. Since we rolled out Snapshot, we've given out over half a billion dollars in discounts to our customers. 563 million dollars, that's real. Thank you. It's not some science experiment that we're doing and it's just gonna go away. This is the future, this is important. This provides us with more value to all of our customers. And that is why we are progressive. So if you wanna learn more about any of this, I would encourage you to attend the session at 12.20 today. We go in a bit more depth and show you more good stuff. Thank you. Havin. Yeah, thank you. Thank you very much, Brian. Thank you. It's always great to see this use case, right, in terms of what Progressive is doing, because it's a perfect example of how you start to use the platform to say, I have to ingest high vines of data and I wanna correlate lots of different data types that I couldn't do before to take analytics from the power of looking at a class of people and what's possible to that particular individual and what matters with that particular individual. And that's only because you can correlate many different data sets in real time to get that visibility and then ingest them from places like the car usage-based insurance and being able to go do that. So it's a great use case of what you can do to transform a business and open up different revenue streams in terms of what's possible. So the thought we do as we close down today and close down the keynotes is talk a little about there were five announcements that came out today. And the important ones that we'd like to give some visibility in terms of what's happening. One, announcement around the connected data platforms, 2.5 in terms of what's coming out and some new innovations around that. Things like Enterprise Spark at scale, so including Spark 2.0. And this is really the incorporation of the ability to start to separate the platform and the core components of the platform like HDFS, MapReduce, Yarn, et cetera. Everybody wants stability and they want it to be rock solid and stable as they're running their production systems on it but allow the innovation to happen on top with Spark and other things and they get rapid innovation and rapid iterations in the platform. And you're starting to see that now in HDP 2.5 in terms of what's happening. More access for users through Zeppelin, through notebook capability and people able to go access and use it. Readiness for the enterprise in terms of Imbari and backup and recovery. So a lot of things happening there. Announcement around HD Insight is the premier cloud solution in deep partnership with Microsoft. A lot of the work that we're doing Microsoft and you had Joseph and Assad and team up here talking about some of the innovative things that they're doing. Expansion of partner works to over 1800 partners but also opening up a new category of partners around managed service providers. We've seen in the industry that companies are looking to say not only do I want to leverage the platform but potentially I want someone to run it for me. And so companies like Accenture and Rackspace and others have now partnered around providing that type of service as a managed service provider that companies want to use the platform. Two other areas, one, we've seen the emergence around companies that want to go start to offload some data from their data warehouse for cost or economics or performance reasons and now expanding that through a partnership with AtScale being able to go take BI and OLAP and other workloads and be able to leverage them effectively in a dupe. And the conversation and the work we're doing with AtScale is also in addition to some of the things we do with companies like Syncsort and others you want to go ingest data for mainframe and other data sources, be able to bring them into dupe as a platform and go process and analyze them. And lastly, probably the most exciting one is the work that Rob mentioned around the genomics initiative, right? What we're doing with DJ Patel and the government, a whole series of companies and putting together a genomics initiative in terms of what we can do to go cure cancer. So a lot of exciting announcements and great initiatives in terms of what's happening. But we've also got a party. So for those of you, the party is tomorrow night. You know, this evening in the reception hall at 6 p.m. there will be an exhibit or reception and drinks. So I'd encourage you to go join that. Tomorrow evening is the big party and the big celebration, celebrating 10 years of a dupe. And that will be here in this complex as well, right? And lastly for tomorrow, we'll start at nine o'clock. Two core areas tomorrow. So on the keynotes, you know, a number of partners like HP, Yahoo talking as well, but also again, more customers. GE, ASU, we'll be talking about their initiatives in terms of what they're doing and how they're leveraging the platform. And then one of the areas that's always very popular, so I'd encourage you to be here for that, is the customer panel presentation. And tomorrow we've got four customers coming up in a very interactive discussion of what are the types of problems they're solving? What are the use cases? How do they leverage the platform? What's important to them? What else do they need? What are they asking the community to help build? So it's always a very interactive and a great discussion. So I'd encourage you to join that tomorrow as we go through that with some customers tomorrow. So with that, thank you everyone for joining day one of a dupe summit 2016. It's time for a break. I'd encourage you to join the exhibit hall before you go to the sessions in the afternoon. Thank you everybody and welcome to San Jose. Thank you.