 Ladies and gentlemen, please welcome co-founder of Databricks, Andy Konwinski. Welcome to the Spark Summit. I'm super excited for this summit here in New York. Last time we were here, we were half this size. New York is an awesome place to have a Spark Summit. New York is to other cities, like Matei Zaharia, the creative spark, is to other engineers. New York is the definition of a great city, and it's a perfect spot for a Spark Summit. Before I introduce Matei, who will give our first keynote, I want to ask you a question. Why are you here? What do you hope to take home with you from this summit? Are you hoping to find use cases in the industrial track this afternoon? Data science, machine learning, or just to meet the creator and the committers? Whatever reason you're here, the next two days are going to satisfy that. We've got talks lined up in the afternoon in parallel tracks, and we've got keynotes in the morning. Our keynote this morning, the first one, will be by Matei Zaharia, creative spark, and he'll be talking to us about another number that's doubled in size. We're going from Spark 1.0 to 2.0. And Matei will tell us about all the great things that are on the horizon with Spark 2.0. Matei is also, in addition to creating Spark, a professor at MIT, and one of the co-founders and CTO at Databricks. Welcome, Matei. Thanks, Andy, and welcome everyone to Spark Summit East. It's really exciting to be out here and to see so many people here in New York. So welcome again. I'm excited to talk to you today about Spark 2.0, which will be the next major release of Spark. Before that, I'm going to talk a little bit about what happened in 2015. 2015, as I think a lot of people are aware, was a really, really great year for Spark. And a lot of the numbers in the community, a lot of the use cases have gone up very significantly. So just some statistics from 2015 compared to the year before. Just the attendees to Spark Summit almost quadrupled. In 2014, we had our second Spark Summit actually in San Francisco. We had around 1,000 people. And in 2015, we went to three locations, including New York and Amsterdam. And we went up to 4,000 total attendees. The number of meet-up members in smaller meet-up groups throughout the world has also increased dramatically from 12,000 at the end of 2014 to around 60,000 at the end of 2015. And finally, something I'm really excited about, the number of contributors to Apache Spark has also grown. So at the end of 2014, there had been 500 total contributors to the project. At the end of 2015, in December, we hit 1,000. And that means there were 500 new people who contributed to Spark in 2015. So I'm super excited. Let's give them all a round of applause to everyone who's contributed. It's a great milestone for the project and I'm sure that we will continue this growth. And to show just one snapshot of this, this is a map of all the meet-up groups that existed at the beginning of last year on meetup.com. And this is a map at the end. So we also opened meet-ups in a bunch of new continents and new places. And we still see a lot of new meet-ups starting every week. And in the code itself, we got a whole lot of new components produced last year and released, including things like DataFrames, Project, Tungsten, Spark, our machine learning pipelines and lots of new features that people are starting to use right away. One of the things that's really exciting for me as one of the developers of Spark is seeing how quickly people start to use these new things and give feedback on them. Okay. So I want to talk, though, about Spark 2.0 and this is going to be the next major release that we'll make. It's slated to come in April or May, depending on how things line up. It's somewhere on the end of April. And it builds on all the stuff we learned in the past two years. And so this is, you know, even though it is 2.0, it's not like a revamp of the whole project, but it is a chance to tie together some loose ends and also to add a bunch of really nice and significant features. So because we had so many features coming in this release, we decided to make the version number 2.0. So just so you understand how versioning works in Spark, this is how it's set up. So you have your version number with, like, three pieces. There's the major version. The main thing about major versions is that they may change APIs. Although we don't like to do that, usually. But this is what that is. There's the minor version. This can add APIs and new features, but it can't break the existing ones. In fact, we keep not just API compatibility, but binary compatibility with the past release. And then there's the patch version. This can only add bug fixes, no new features. So in reality, even though 2.0 is a new major version, we really hate breaking APIs. You can ask anyone who's worked with me on Spark how much I push back on this. And it's because, you know, as a user, the worst thing is like when you have something that works, and then you upgrade one of your packages and it stops working. And, you know, you're just stuck. You can't do anything about it. So in 2.0, we're not just going to break, you know, lots of APIs. We will change a few of them, but in some pretty rare cases where they cause dependency conflicts. And we'll talk about that in a little bit more detail of that. If you're familiar with the Java API is the use of Guava, which is the Google Collections Library. We use it for one really uncommon thing, which is options. And the option class in Guava, even though it's a very simple class, if you're a programmer, it's a collection with either zero or one element, that is not backwards compatible across versions of Guava. So we're getting rid of that. But these are the kind of changes that will happen. So hopefully for most users, the update to 2.0 is not going to be, you know, it's not going to break their code. So what are the actual features? There are lots of features coming in the release. I'm going to highlight three of them that I think are the most important. And these are first the continuation of project tungsten to speed up Spark, especially the structured data part of Spark. And we have some really cool optimizations that are landing in this release that will give speed ups of 5 to 10x for some really important uses. Then we have structured streaming. This is a higher level streaming API similar to DataFrame, similar to Spark SQL. It's built on the structured data engine in Spark. And the other exciting thing about this is it's really meant to push Spark beyond just streaming to a new class of applications that do other things in real time. They don't just analyze a stream and output another stream. And I'll talk more about that later. And finally, we have unifying data sets and data frames. This is the most technical, but it creates a really nice foundation in the future for the growth of the project. So let me start with tungsten sort of phase two. So just a background on project tungsten if you're not familiar. You know, since Spark was released five years ago, hardware has changed a bunch. And the main thing that changes is that CPUs aren't really getting much faster, whereas IO has gotten much faster. So in a lot of big data applications, you know, the bottleneck used to be the network or the storage, but now it's increasingly the CPU. Now, how do we improve that? We want to make a large, as large part of Spark as possible, execute really close to bare metal, same way as native code. And basically, there are two pieces. There's native memory management, which bypasses the Java VM. And there's runtime code generation for a lot of high-level libraries. We generate expressions that will, you know, that will run fast on this native memory without creating lots of Java objects. Up to, so tungsten came out last summer in Spark 1.4. And since then, we've been adding basically just this binary storage layer and basic code generation, which did provide a whole bunch of speed-ups as well as improved robustness for large memory workloads. And we also added two APIs, DataFrame and Dataset, that let you use tungsten in user programs. The key idea in these APIs is, you know, normally in the Spark API, you just had a bunch of Java objects floating around, and there wasn't that much we could do to optimize or to change the representation. In these APIs, you get data with a known schema, and even though you can have a shim on it that looks like Java objects, which is the Dataset API, we control the storage underneath and we can do much more efficient execution for a lot of operations. So it's also used in Spark SQL and parts of MLlib. So what's coming next? So in 2.0, we have two really big optimizations. There's actually a talk about these later today from Nong Lee, so you can get a lot more detail. The first is whole-stage code generation. What this means is it removes the iterator calls between different operators in Spark, and basically, if you do a bunch of operators together, like a map and a filter and a group by, it fuses them all into one snippet of code that's optimized to do just these three operations that doesn't have any virtual calls. And that leads to some pretty significant speed-ups. So here's like a simple example of a benchmark. This is just processing data once it's in memory in Spark with a simple SQL query, and this improves by almost a factor of nine through just better code generation. And the second thing we're optimizing is input and output. We found IO both from Parquet and from the columnar cache built into Spark isn't always the most optimized, and with a little bit of work there, actually I should say a bunch of work there, we're able to make that quite a bit faster as well. So Parquet is the on-disk format we're focusing on at first, but the same things will apply to ORC and other formats in the future, and it's also about a factor of nine. And the really cool things about these is you don't have to change anything in your programs. You can automatically apply to SQL, data frames, data sets, ML lib in many cases, anything that's built on tungsten. So that's one of the things. Check out non-stock on optimizing Spark this afternoon for more details. Second thing is structured streaming. This is one I'm really excited about. This is more on the new API side, but it does some really cool things. So basically we see real-time processing is increasingly important for a lot of Spark users and big data users in general. But what we discovered, you know, talking with users of Spark streaming is really most applications don't just need to do streaming. They're not just like, oh, here's a stream, you know, apply a map function and get another stream. Really the most interesting applications and the most important ones combine it with other types of data analysis, including batch and interactive queries. And this is a thing that current streaming engines don't really handle. They're built for streaming. They're just like, give me a stream in, I'll give you a stream out, that's it. So just examples of other types of applications, a super common one we see and pretty much all users of Spark streaming is I want to build up or track state using a stream. For example, I want to track sessions of users on my website and then I want to run interactive queries on this state because it's real-time data. I have real-time questions to ask about it. But this is a combination of streaming and interactive that isn't really handled by current streaming engines. You have to put the data somewhere else and query it and it becomes very operationally complex. Another example is I train a machine learning model offline and then I want to apply it to a stream or even update it using a stream. And again, you need an ML library that works across these things and maybe a system that can go back to the offline portion. So Spark is obviously very well suited to do this because it supports all of these types of computation and we are also looking to make them super easy to combine. So what is structured streaming? It's a higher level streaming API that's built on the Spark SQL engine as well as a lot of the ideas in Spark streaming. And it's a declarative API that extends data frames and data sets. So that means it can run over tungsten to get all those optimizations. It can do a lot of optimizations on its own like logical optimizations that you'd get there. There are a bunch of higher level features that you currently have in Spark streaming such as event time, out-of-order data can be handled using all the operators, windowing, different types of windowing. Very easy to create sessions and a really rich API for data sources and syncs. Many of the same things that make Spark SQL easy to use. But on top of doing streaming it also supports interactive and batch queries in a way that no other streaming engine does. And for example, you can do things like aggregate the data in a stream and then serve it using the Spark SQL JDBC server and just have ad hoc SQL queries that act on the latest state. It's super easy to do that with Spark. You can change the queries at runtime, you can add queries, remove them and so on in this engine and you can also build and apply machine learning models and we're making most of the libraries in Spark interact with this in the batch and streaming settings. So really the idea we want to do here and we'll talk about this more is we don't just want to do streaming, we want to do what we're calling continuous applications which is an end-to-end application including say something that reads a stream and then serves queries off of it and there isn't really any single platform today that does that so that's why I'm excited about that. Just to give a really small picture here are some things you can do with structured streaming. The actual streaming is you know you take in data from something like Kafka and you do maybe ETL extract, transform and load basically a map function or a group byte and you stick it into some other system like a database or sometimes streaming is about okay you do ETL then you build the report and then you serve it to some applications. With structured streaming you can also do the following you can just do ad hoc queries on the same stream no need to put it on the same stream. So this stuff in orange is other processing types that isn't traditional streaming and likewise you can train say a machine learning model, maintain it that might involve running a batch job once in a while and then serve it applied back to a stream and again it's very hard to do that with a purely streaming engine. So basically the goal is to have these end-to-end continuous applications. Spark 2.0 will have the first cut at this which will focus mostly on ETL but it will lay a lot of the ground work for the other things so hopefully you'll get some of these other features too. Later versions will add more operators and libraries and Reynolds keynote tomorrow morning will be just about structured streaming and Michael Lombost also has a talk later tomorrow with even more details so I hope you stick around for tomorrow to hear about this. So we're going to talk about data sets and data frames. So these two APIs you know if you're not familiar they're pretty new but they're pretty exciting. We added these APIs as ways to work with structured data and Spark mostly to enable the tungsten engine underneath and to enable us more control over memory layout and execution so that we can get really fast execution on all the hardware coming out in Python and R so really nice for scripting but maybe not that great for building large complex programs and then data sets which came out in 1.6 add static typing so you can have a data set of say people and you can view it as Java objects but things just get converted to Java objects when you act on them underneath they're still represented using the binary format in tungsten like data sets that run on tungsten. So in Spark 2.0 the main thing that we're doing is that we will merge these APIs. Both APIs were marked as experimental especially data set it was still pretty new but we think we've got enough experience to you know to actually combine them and finalize them and by merging them basically a data frame will just be a special data set of objects of type row but it's really interesting. So example of how you might use this so you know say we have a bunch of classes in Scala same thing works in Java and Python actually very similar APIs like we have users with a name and ID and we have messages with a user in them so you can load a lot of Spark's input methods bring you a data frame so you can load a data frame say as Jason and this is actually a data set of row objects and if you have a data frame that as message and then will look will match the fields of the message class with the fields in your Jason schema and suddenly you have static typing you know you can pass this around and like your code knows what type it is so you get all these benefits of software engineering from static typing and then you can do operations on it looks very similar to the RDD API and they all look like you're doing a lot of new types in them this is the one part that's different from you know the current data frame API and that is like a small API break currently map on a data frame gives you back an RDD and we're changing it to give you back a data set so this is like the one you know the one like small thing that we have to fix in here but the benefit is really nice because now this is all kind of unified and finally all the libraries you can pass your ML pipeline just the data set of users or messages or whatever you want and it's all set up in terms of these so it's a really fast way to move data between different libraries so that's what we're doing so benefits it's a lot simpler to understand in fact the only reason we didn't do this change when we release data set in 1.6 is because we wanted to keep binary compatibility so we couldn't break parts of the API like we can do it in 2.x. Libraries will take data of both forms so when you write a library like ML Lib you don't need to worry about like what classes the user is using as long as they describe their stuff in a schema you understand so that's pretty cool and with streaming we think I was told it's 98.2% sure that we'll use the same class data set to represent an infinite data set or a stream so actually we'll work on streams we'll work on streaming and it's really cool because you can test your program locally on static data and then just run it on a stream it's again something that you don't see in other streaming systems so this will come out in 2.0 so long term RDD will remain the low level API in Spark if you want full control over how your data is represented as objects what code you run and all that RDD is the way to implement it. So you can see that there are new libraries in Spark and new libraries in Spark will increasingly use these as the interchange format so as example structured streaming and ML Lib already using them another talk you'll see this is in the research track today is a research project between MIT and Berkeley called graph frames which is a graph API based on data frames and SQL that also uses these data frames in Spark and it's really cool to have some of the top features in 2.0 there are a lot more things coming but I hope it gives you an idea of what to be excited about and all these things I mentioned detail talks about them later on today and tomorrow so I hope you check them out. So thanks again for coming out to Spark Summit East and we hope you enjoy the rest of the program. And maintain a focus on stable APIs. It's really impressive. Spark 2.0 is very exciting. Next up we have a co-founder and the CEO of Databricks. Aligotsi will be talking to us about democratizing big data. Thank you Andy. So let me see forget the slides up here. Okay they are up there. Thank you and welcome. I'm going to talk about democratizing access to Spark and you've heard us speak here before and we've told you that we created Databricks to simplify big data. So by that I mean that we wanted to enable more people in organizations to be able to ask questions from data so we wanted to democratize access to Spark. Not just have it be the few people who know how to access and write these complicated programs. And towards this I'm going to make an announcement in this talk that I'm really excited about but it's towards the end of the talk. Before that I want to talk about the journey that we've had over the last two years with Databricks and Spark. So when we started Databricks we decided to host the platform in the cloud and the reason we did that was that our model allows us to configure everything for you and make sure that it works end to end. Second we could have rapid releases that means we could get our software out in the hands of customers every other week or every week. And the main benefit of that of course is that you can get feedback on the features that you're releasing so you can iterate and learn from your mistakes much faster than if and finally it enables a dynamic use case so you can spin up environments and deploy your use cases whenever you need to if you need 100 machines for 3 hours you can do that. You can also compose your extractions because there are already other companies that are providing you services in the cloud so it's much more dynamic in that sense. So let me talk about the platform that we built and that we gave last summer at Spark and outside of this we also built a lot of integrations. These were features like security, governance, auditing, multi tendency, production jobs. These were features that we had to build for the first time in the cloud around Spark. So we did that and there are table stakes for many enterprises. On top of this towards this goal of democratizing access we also built an integrated visualized plot and talked directly to Spark. And all of this of course whether it's you can run it on top of any storage system that you have. So Spark is great at federating the queries down to Hadoop or data warehouses or more frequently these days storage that you have in the cloud. So let me talk a little bit about how it's used so far and I'm going to tell you about it. And over 80% of our customers use this use case and what this really means is that you separate compute from storage. So Hadoop in the early days would combine the two but what we do here is that we separate them, you store your data often in the cloud in a very elastic way and you pay for whatever you're storing there and it's very reliable. Simultaneously you use the Databricks platform to launch this project. So I've listed here that three out of the 10 top mass media companies use this and what they were able to do is take ideas that they had and get it all the way up to an app in much shorter time. So they could shorten this from months or weeks sometimes down to days. The second use case is on top of the just in time data warehouse and it's advanced analytics use case and this is to build richer models and do advanced analytics. So radius intelligence has a talk this afternoon that's really exciting. They'll talk about how they build a very complex model using machine learning of 20 million companies out there in the world. And the final use case is a use case we see more and more especially this year. There's a lot of excitement around real time and you have a lot of people that are working on real time streams and they can combine batch real time and all the other libraries that are in Spark. Here we have a top five credit card company that's doing loan approval in real time. So it means people apply for loans and in real time it will run machine learning and figure out whether to approve or decline that long. So what's the main lesson. So we've had hundreds of companies that while we've solved the technology bottleneck there's still a human bottleneck so companies still struggle with big data projects and the main reason that we see is that there's a really steep learning curve for developers. It's still hard to learn this stuff. And the reason for this is historically development you would do it on your own machine or in your own time consuming. So a lot of people struggle with first they have to acquire machines, pay hundreds or thousands of dollars to just get those. Then they have to set them up and configure those machines which can take a lot of time and then finally to build the actual applications they have to go through and stitch together many apps and there might be poor documentation and they struggle with this. So how can we get around this? How can we empower more developers to get access and insights from big data? So in 2014 we set out to train people and help them overcome these hurdles. So our goal was ambitious and we wanted to train 2,000 people and I think we hit that towards the end of the year and we were really happy. And in 2015 we tried something a little bit different. We launched two massive online open courses, MOOCs, and we were overwhelmed by the numbers. 125,000 people took our courses and over 20,000 finished the course end to end and it was an accumulated over 500,000 hours spent just learning Spark. So we were very humbled by this and we wanted to double down on this. So we're thinking how can we multiply this and democratize access to Spark. And this brings me to the announcement I mentioned earlier. So I'm proud to announce that today we'll be releasing Databricks Community Edition. What is Databricks Community Edition? It's a free edition of Databricks Spark platform. You'll get access to many Spark clusters. These are clusters that you can freely use. You'll get notebooks, dashboards, the collaborative features that we mentioned earlier and the APIs that I mentioned. But more importantly, you also get continuous delivery of content that we will be uploading there. So you will already have access to the courses and the MOOCs that I said that's already uploaded there and we'll be uploading also how-tos and documentations. And this actually uses a different version of what we trained on in the last couple of days. Okay. And that's a Databricks logo with the community building up around it. So I'm really happy to say that today every attendee that's here gets access to this. And on top of this, we're going to make it really seamless for organizations to transition if they want bigger clusters than those mini clusters I mentioned or if they want to build production pipelines or get access to any of those enterprise features like security and governance, we're going to make it easy for you to set up your own accounts, put in credit card information and upgrade to the other professional and enterprise tiers. Okay. So without further ado, I want to welcome on stage Michael Ambras to do a demo of Databricks Community Edition. Michael is one of the main committers on the Spark Project. He's the lead on SQL. So welcome. Thank you very much Ali. I am super excited to be here today to show all of you guys some of the really cool things that you are all going to be able to do using Databricks Community Edition. As you can see, we've got my email inbox up here and I've received my invite to join the beta program. Everybody in the audience should be getting this throughout the day. And for those of you who are watching this on the live stream, I encourage you to head over to Databricks.com where you can sign up for the wait list. We're going to be trying to expand this beta program as quickly as possible. So, you know, all you have to do once you get this email is click to activate your account. It's going to take you through a sign up flow and after that it's going to drop you into your own personal copy of Databricks. This is your one-stop shop for creating spark clusters, creating interactive notebooks, and learning about spark in general. As Ali said, this is pre-populated with a whole bunch of educational content. So, if you just head over to the workspace, you can start with the basics in the Databricks guide. This kind of gives you all of the details of using Databricks itself. How to create a spark cluster, how to create a notebook, and even advanced topics like how to take a data frame and create an interactive visualization with it. For those of you who are just getting started with spark, we've also got you covered. If you go back to the workspace, you'll see that we've actually got an entire college course about learning Apache spark. This is an award-winning, massive open online course taught by Anthony Joseph out of UC Berkeley called Introduction to Big Data with Apache Spark. We've integrated it into the workspace. So, all you have to do is click on it and you can see all of the lectures in these YouTube videos that you can work through at your own pace. And really, it's even more than just watching a bunch of videos. This is a fully interactive experience. So, if you click on one of the labs, you'll see that this is also an interactive notebook. And if I want to actually follow along and test my spark knowledge, all I have to do is click on import notebook. And what it's going to do is it's actually going to take a copy of this and move it into my home folder inside of Databricks. And so once I've done that, I can actually go to any of the cells that contain code, and I can hit shift enter to run it. And you'll notice as soon as I did that, what it did was it actually attached me to one of the spark clusters that was already running in the cloud. So, that's pretty cool. But personally, when I try to learn something new, the way I like to do it is to just dive in and start analyzing some data. And when you're working with a big data product like Spark, sometimes it's difficult to find an interesting data set to get started with. But fortunately, Databricks is actually preloaded with a bunch of cool data sets, one that I think is particularly interesting is the Wikipedia clickstream data. So, again, if I click on this, it's going to take all of the code that I need to access this data set and clone it into my home directory into a notebook. And what you'll see is, let me kind of describe what this data set is all about. So, this is a data set that was released by the Wikimedia Foundation. It contains aggregate statistics about all 3.2 billion requests that Wikipedia received during the month of February 2015. So, just a year ago, all of the requests from Wikipedia. And what they've done is they've actually aggregated it down into source and destination pairs. So, you can actually track the flow of traffic through Wikipedia, how people are clicking from page to page. As an example, let's look at a visualization of what the data looks like for the New York City Wikipedia page. So, as you can see, a majority of the traffic comes from Google, which is kind of unsurprising. But another major source of traffic is the New York State Wikipedia page. And then, similarly, if we look at where people go once they're on this page, they click on other related topics, like New York, Manhattan, United States. So, now we understand this data set. Let's actually dive in and take a look at it. So, you'll see the first line of code here actually loads the data set from the Databricks file system. It's already pre-populated, and we've converted it into an efficient format parquet, so we can read it pretty quickly. So, again, as soon as I hit Shift Enter, it attached me to my cluster. And now we can actually take a look and see what the records of this data set look like. So, we'll do Display, Clicks, and it's going to run a short spark job. And you can see an example of the kind of data that we're going to be working with. So, you can see what this is telling us is that 52 people clicked from the article, Valley Parade to the list of accidents and disasters by death toll, which is a pretty heavy reading. So, okay, pretty cool. So, now let's try asking some more complicated statistics about this. I'm in particular curious about the flow of traffic within Wikipedia itself. So, I'm going to use Markdown to explain what I'm doing as I go along. We'll say, what percent of clicks from other Wiki pages? So, this is pretty easy to calculate using data frames. We'll start by calculating the total number of clicks in the data set. Clicks. And we'll calculate a sum of all the n. And then we'll tell Spark to run that job and then get the first entry. And then to calculate the clicks from within Wikipedia, we'll do WikiClicks equals, and we'll take the same code here. Except this time, we'll apply a filter to remove clicks that are coming from other sources. So, where PrevID is not null. So, where we actually know the page that it's coming from. So, now that we've got these true numbers, calculating the percentage is pretty easy. We'll just do WikiClicks divided by all clicks and then multiply by 100 to make it a percentage. And so, as soon as I hit Shift-Enter, it's actually firing off a distributed Spark job. It's running in the background. You can see that it's actually taken that data set and it's split it up into 39 pieces. And if we click on View, we can get a better idea about what's actually going on under the covers. It's actually doing this calculation in two different phases. One phase that is calculating the sums for each of the individual partitions and then it's doing an exchange or a shuffle, which is collecting all the data into one place and then calculating the total sum for all of the pages. If we look at the next job that's running, we can see it's doing something very similar, but this time, it's also doing a filter to remove all of the clicks that aren't coming from Wikipedia. So, pretty cool. And what we learned here is that actually 33% of the traffic from Wikipedia is coming from Wikipedia itself. It's pretty cool. A lot of people just clicking on through the web, which I know is something I can get lost doing for a while. But that actually took a while. That took, as you can see here at the bottom, that took 37 seconds to run. And Mattea was talking this morning about a bunch of really cool performance improvements that are on the horizon. They could be pretty cool if we could actually play around with those. So, typically, using the bleeding edge versions of Spark require you to go to the Apache website, download the code, compile it, and then go to the Spark 2.0 cluster. But fortunately, in Databricks, it's a little bit easier. I've actually pre-started a cluster running Spark 2.0, so I'll go up to the clusters menu and I'll detach my notebook. And all I have to do is attach it instead to the Spark 2.0 cluster. And now we can ask the question, how much faster is Spark 2.0? Hopefully faster. So, now that I've attached to the cluster, all we have to do is reload the dataset. And then I can take exactly the same code that I was running before, and we'll paste it down here, and we'll hit Run. And so, it looks like it's going faster. And if we actually dive in and look at the details, you can see exactly what Matei was talking about in his keynote. This whole-stage code gen has actually fused all of the different operators together into one very efficient operation that's taking advantage of all of the modern features in CPUs today. And so, as you can see, that actually took 13 seconds, so, you know, pretty fast. Cool. So, now that we're on a hyper-optimized version of Spark, let's move forward with our analysis. So, the next thing I want to do is I want to select an interesting set of pages. And we can do this using SQL. Since Spark is a unified platform, we can actually kind of switch back and forth between different programming paradigms based on what's the best tool, you know, for any given job. So, we'll say select star from clicks, and let's do some filtering here. So, first of all, I'd like to exclude Google. I want to zone in on just the things that are coming from Wikipedia. So, we'll say where preve title, not like, and then we'll exclude anything that starts with other. So, that's any of the search engines. We'll also exclude traffic from the main page. It's not equal. Main page. And, now let's actually zone in on a specific set of articles. So, we'll say cur title in, and let's pick something topical. Maybe Donald Trump. T.R.U.M.D. And, we probably don't want all of the clicks to the page. That'll be pretty hard to visualize. Let's just take the top ones. So, we'll say order by N. So, to order by the number of clicks. And, we'll limit it to just the top 20. And so, you can see here are the top refers to the Donald Trump webpage. So, now we've got this, but I actually want to do a cool visualization with it. And, to be perfectly honest, I'm actually not a very good JavaScript programmer. So, I'm going to do what any good programmer does, and I'm going to go to Google. And, I'm going to type in Spark SQL Data Frame Force Directed Graph D3. So, just kind of some keywords about the type of visualization I'd like to make. And, we'll see, oh, that's convenient. There's a link at the top. And, it just so happens that there's an example of how to create this kind of graph with Spark SQL. And, if this looks familiar, that's because it should. This is actually a notebook that has been published to the internet. And, the coolest part about being a notebook, instead of just some fragment of code that you find on Stack Overflow, is actually really easy for us to take this and import it into our Databricks workspace. So, this looks like a pretty cool visualization that I'd like to use. So, if I just click Import Notebook here, it's going to give me a URL. And, I can copy this URL and head back over to my workspace. And, in my home directory, I can click Import. And, we'll paste the URL. And, what Databricks is going to do is it's actually going to go and download that HTML, parse it, and take the code and insert it into my workspace. And, as you can see, there's actually kind of more than meets the eye here. We can actually see there's an entire library of code to create this visualization. So, now that we've got that, let's actually just copy it and take it back to our original analysis where we can use it. I'm going to go here, paste it, and hit Run. And, now that library has been compiled and loaded into my cluster. So, now we've got a bunch of results from a SQL query. And, we've got this Scala library for doing visualization. We need to combine the two. And, normally, this would take a fair amount of boilerplate to translate the rows into the correct format. But, as Matei again was talking about this morning, there's a pretty cool feature that we debuted in Spark 1.6, but we're improving a lot in Spark 2.0 called data sets. So, just to kind of visualize exactly what I'm talking about, let's pull up an image here. And, so what you can see is data sets are actually a really nice bridge between the semi-structured, relational world and the type-safe, object-oriented world. So, if I just take the sample code from here, and we'll copy it, and then I can take my SQL query from above, and we'll just insert it here as the set of clicks. So, equals, SQL, and all I have to do to translate it into this edge format that this library is expecting is, say, as edge. And, when I hit enter, Spark is actually going to do a mapping. It's going to tell me, do these columns' names line up? Do I know how to map this data? And, if it doesn't, it'll provide a helpful error message about how to fix it. So, in this case, it's telling me I need to tell it which column is the source. So, let's provide that mapping. So, we'll say the previous title is the source, the current title is the destination, and n is the count. So, now we hit shift enter. We have our visualization, and Donald Trump is in the center exactly where he'd want to be. So, let's expand it to a couple more candidates. Hillary, and let's also add Bernie Sanders, and hit shift enter again. And, you can see now it's actually, and let's make this a little bit bigger, so you can see the whole thing. Now, you can actually see, we can actually see the interrelations between the candidates as visualized by clicks on Wikipedia. So, we've got Hillary over here, and you can see that the United States presidential election even a year ago was a major source of traffic for all three of these Wikipedia articles. And then other candidates, both Bernie and Hillary actually share a bunch of traffic from the Democratic Party presidential candidates. So, pretty cool that you can see this even a year ago in this data set. So, now that I've got a pretty cool visualization, though, I'd like to share it with some of my friends. And fortunately, collaboration is built in as a first-class feature inside of Databricks Community Edition. So, I'm going to go over its settings, and I'm going to add a user. Let's invite my friend Miles here, databricks.com, and send him an invitation. And so, in Miles' inbox, he should have gotten an invitation to come and join my workspace. In the Community Edition, you can only share your workspace with three people, but you can actually be invited to an unlimited number of different workspaces. So, this is a great way to collaborate on different types of data analyses that you might be doing. So, if we go back over to this Wikipedia clickstream article here, and we can see that Miles has actually joined my workbook, and down at the bottom, I think he's already started to do some cool visualizations. And it looks like he's studying hipsters, which are strongly associated with Brooklyn. Cool. Okay. So, that's really nice. And, you know, sharing with Miles was great, but there's a lot of people here, and I'd actually like to share this analysis more wildly. I'd like to share it with the whole world. And this is, I think, actually probably the coolest feature of the Community Edition, is the ability to take any notebook that you've created, you can click publish, and what it's done is it's actually sent it out to the Internet, a static copy of this, and it's giving me a URL that I can share publicly with anybody in the world. So, if I copy this over here, on Twitter, I can say, check out the demo from hashtag spark summit community edition, and we'll paste that right there, and I will tweet it. So, now you all have access to that code, and you can do your own visualization. So, if we click on this, just to see exactly what it ends up looking like, it's actually an exact copy of this, statically published as HTML that you can look at. So, pretty cool. And here comes my favorite part of the demo, the part where we let all of you lose on Databricks Community Edition and see all the cool things you can do. So, thank you very much. Thanks a lot. Ali and Michael. That was pretty awesome. We're going to be rolling out this Community Edition codes for you in the next two days. I'm skeptical about one thing. I went to grad school with Michael, and we shared a cube, actually. So, I'm going to be talking to you as a developer. Next up, we have Sean Connolly, who's the VP of strategy at Hortonworks. Sean's going to be talking to us about bringing Spark to the Enterprise. Let's welcome Sean. Good morning, Spark Summit, New York. I'm Sean Connolly. I focus on strategy at Hortonworks. So, spent a good bit of my time in the sort of the big data space, if you will. For those of you who don't know me, I have two things. I'm a long-time Philly guy, so part of my accent. I live here on the East Coast. So, it's exactly. I'm a long-suffering Philadelphia Eagles fan. I'm also an open-source addict, as I like to describe myself. I've been an open source since early days J-boss, so a little over 12 years. J-boss red hats bring source in the last four and a half years at Hortonworks. The last, you know, eight or nine minutes or so is really talking about how we view Spark in particular, but how do you enable open-source technology to be very widely deployed across mainstream enterprises. So that, you know, I'm sort of in this great position where I get to see innovative open-source tech, but also figure out how to translate it to the Enterprise. And so, while the tech is cool, and I'll get into some of the thinking that we have here today, it really starts with why should sort of businesses care, right? Who cares, right? And at the end of the day, you know, Spark, particularly in a broader data architecture, helps those enterprises unlock really enormous potential from their data, any and all form factors of data, whether that data is in motion, data rests, or anywhere in between. And so that's what I like to focus here. And then I'll talk a little bit about how when we think about integrating Spark into these data architectures, whether they're real-time or like 10 years of historical analysis, how does that fit, and how can that unlock things? So the first example I'd like to share is, you know, the story of web trends. And some of you may have heard the story of web trends. Over the past few years, they've been early adopters of not only a Hadoop, Spark, but other Apache open-source technologies in their architecture. Pedabyte scale problem. They've been in the space of serving analytics to their customers for a while. 13 billion daily events processed, latencies as low as 40 milliseconds. So this is pretty serious infrastructure. You know, they were able to consolidate their Spark cluster and their Hadoop cluster in the one so they can actually operate it and secure it and manage it centrally. They've got a lot of economies of scale there. But what I like about their journey, and actually this visual here, has been truly a journey where they started off with data discovery and web log analysis and single-view use cases, and each use case keeps building on the other as you assemble more and more data, is ultimately the integration of Spark into this architecture enabled them to unlock a new product for the market, right? So it isn't just about doing really cool analytics, but they were able to identify and tap into new revenue streams. So their web trends explored really about enabling their customers to do more ad hoc data discovery scenarios and interact with the data along with the traditional web trends experience, if you will, at scale. Two other ones that I'd like to use is really a communications company and that's really around monitoring channel changes as you're watching TV, being able to get targeted advertising and that kind of stuff. Allocate ads in real time. I've talked with a few people so far here and I've heard this story at other companies a few times, so not a new use case. The other use case, and I love these types of use cases, is big data and Spark being applied to railroad companies, right? So it just is not the purview of high scale, pedabyte scale web monster type thinkers. This particular use case is I lovingly refer to it as a train doctor where the trains have sensors and they're capturing images and they're really trying to make sure they stay on top of the maintenance of the rail and so there's a lot of data different form factors, GPS location and other things that they're bringing in to do more real-time analysis of the rail so they can prevent accidents. I lived down in the Philly area, I took New Jersey transit up here, I would like them to actually use a solution like that as well, right? It was a little bumpy ride, we had to stop a few times, but that's how this new age of data is changing how people fundamentally think about what they can do, what the art of the possible is and so these are the types of use cases that inspire me when I do my job as part of the WorkWorks team and I work with our customers when I use case identification if you will. So now I want to share with you some of the trends, like I said I'm an open source addict, the technology pace of innovation is just phenomenal and I would argue that Apache Software Foundation played a key role not only in the age of web with the Apache web server but also in this age of data there's just a lot of innovation in Apache, Apache Spark being a perfect example of that and so the implications for making this approachable by mainstream enterprise is there's clearly the data API, right? And so you'll learn a lot about things there and I have some comments on that. Then there's sort of the enterprise readiness and hardening how do you make it easy to use a consume and familiar for sort of the mainstream and then there's more work that clearly needs to be done around data science and analytics because it is an emerging frontier, right? So there's always innovation there and we have to enable people to get on this innovative bus if you will, right? They can't just wait for it to stop, they have to figure out how the time their entrance onto this because it's going to continue to move. So from a data API and you saw some of the earlier sessions it's about providing that surface area for developers and I've spent a long time in the developer ranks Jboss and spring source and talking to developers on the big data side it's really around really innovative analytic apps less about web and mobile but the caring feeding for developers is very much the same. How do you give them a rich set of APIs? How is it elegant? How do you remove obstacles of adoption? And then when you go to deploy it, how does it integrate easily, right? Maybe abstract how it integrates. And Spark is a great example of that is it's able to federate data from almost any data source, right? So integration is part and parcel to getting the data in the spot where you can do interesting things with it, right? And then at that point it will be a critical tool in the enterprise toolbox where it will be a natural way to develop apps. The hardening clearly there's things around HA and DR in the Hortmorks realm we actually have two platforms, which is a Hadoop based platform that we integrate Spark with. But we also have a Hortmorks data flow offering that very much things like Apache, NiFi and Kafka are sort of part of that architecture. When you use them together you get that sort of joined up experience. But these notions of security, encryption, governance and stuff don't go away. You have to address those particularly for mainstream enterprise. And from a scale perspective it's not just on-premises we partner with Microsoft around their HD insight service. So Spark in the cloud at global scale is important, right? And that's why the innovation continues to need to move forward. And I'll sort of close out on some of the thinking there. But again my developer mentality is how do you make the analytics development process as agile as possible, right? So it isn't people use data science and at times it feels like it's an unapproachable thing. And to me it's really about how do you enable agility? How do you democratize that out and make it easy and better tooling around that? And there's no single definition of a developer, right? You're a scholar developer, a Java developer, a Python developer, an R developer or you're using higher level tools and you're doing more business development. So you need to make sure that you're addressing the experience across those who really want to get down and dirty, don't want tools in their way, to those who really want a great out of box experience at a higher level. So how do you democratize that across all layers, right? And so when we think about investing and partnering and integrating, these are some of the things that are top of mind. So to close out and sort of encourage you to hit up some of the sessions but in the area of geospatial analytics and data science things like the Apache Zeppelin project things like entity resolution functions geospatial analytic functions are important as accelerators in that space. I think there's actually a session later on on the Magellan Geospatial Library and how you think about geospatial that's important for that train example but it's important in the insurance industry and everything. Everything has a GPS location, so how do you make it easy to create those types of applications? The accelerate capabilities for the enterprises, how do you make sure it integrates in a familiar way with a lot of the tools and technologies that make sense for this technology to integrate with? Again, there's a broader modern data architecture that needs to be enabled and then always there's the notion of going to innovate at the core whether it's integrated with Hadoop whether it's integrated with streaming technologies how do you enable it at the core to do that really well and continue to move the pace of innovation forward. With that I want to close out I just sort of give you just one more thing teaser on March 1st one of our partners the HP Labs folks have created some really interesting Spark technology we'll be talking about that on March 1st so the innovation train continues, it never stops so just stay tuned and with that hopefully enjoy the conference and thank you for the time. Thanks a lot Sean Next up we have a talk by somebody at IBM, Anjul Bahamri is not a first time speaker at the summits. Tonight today, this morning, she's going to tell Spark as the analytics operating system. Anjul is VP of Product Dev for platforms and analytics at IBM welcome Anjul Welcome everyone, it's a pleasure to be here this morning a few years ago, IBM celebrated its 100th anniversary and over this last century many of our engineers and scientists in the labs have invented scores of game changing technologies from mainframes to PCs from Fortran to SQL and from Deep Blue to Watson so we are very proud of our storied history but every technology that we have bet on and supported was not invented in our own labs we recognize a good thing when we see one and we get behind it so we do not have a not invented here syndrome Linux is a prime example of that it was in the year 2000 that IBM announced 1 billion dollar of investment on Linux and that really accelerated the power of innovation in the open source the CEOs and the CIOs and Linux entered the enterprise worldwide several years ago we bet on another game changing technology which we believe is as game changing for analytics as has been Linux for the operating system and that technology is Apache Spark so today we are all here to celebrate Apache Spark what started 6 years ago as Matej Zaharia's PhD thesis at UC Berkeley's Amplab as he just shared has now 1000 contributors and at least when I counted at least half a million lines of code but with the new announcements that must have gone up and it is entering the enterprise just like Linux did we think at IBM that this technology is so fundamental that we think of this as the analytics operating system and you know why do we think that there are many reasons for that you know never before have such a rich set of analytical foundational capabilities all come together in one platform in one stack Spark is really the single toolbox for analytics if you have structured data you can use Spark SQL for semi-structured, unstructured data you can drop to Spark Core if you have data coming from the fire hose there is Spark streaming for building models you have MLib you make use of machine learning for learning from graphs of data there is GraphX and the beauty of this is that all of these components work together in a seamless manner you know in the past if you needed these kinds of capabilities you would have needed at least half a dozen products each with its own install configuration and nuances and today you just need one foundational platform which is Apache Spark so we at IBM we love Spark we are enhancing it we have the Spark Technology Center that we started in San Francisco last year and here we have our engineers and comiters on Spark who are contributing to SparkR, to Spark SQL they are fixing bugs they are hardening the documentation and we are also offering it as a part of our products so both on prem and on the cloud so Spark is part of our big insights product for customers who want to transition from Hadoop, MapReduce, workloads in a seamless manner to Spark and we also offer it on the cloud as Spark as a service and we are leveraging the scale and the unified programming model of Spark throughout our product portfolio so I'll give you just a glimpse of how we are leveraging this in our portfolio we already have about 15 IBM products that we shipped last year which are all leveraging Spark over a dozen are in the works in the labs and just to give you a few examples of how our product portfolio has benefited from the expressiveness and the speed of Spark so our ETL engine and our data shaping engine we re-platformed it in one year and reduced the number of lines of code from 40 million to about 5 million I mean this was huge for us for SPSS by pushing down to Spark and by leveraging ML Lib and System ML which is something that we had contributed to the open source last year we are seeing about 3-6x times performance improvements in the predictive model execution over hundreds of terabytes of data our Watson analytics it makes use of now these re-platformed ETL and predictive analytics capabilities and they have been able to leapfrog to interactive and visual analytics I also want to take the opportunity to plug a technology that we just announced it is called Quarks it's a platform for building end-to-end IoT applications and it's a lightweight and embeddable framework so it's embeddable in the edge nodes and it is tightly integrated with Spark and we are making that available to the open source community so moving along let's look at how some of our customers are leveraging Spark IBM recently acquired the digital assets of the weather company and you probably know them through brands like the weather channel or weather.com or it's really a data company so they provide weather data to Apple, Google, Facebook Samsung and many others they serve about 30 billion API calls every day which is almost 60 times the number of tweets that happen on a daily basis they have about 120 million mobile users so the chances are that you use your phone to check the weather anything to do with weather you're probably going through them so when you look at all of this data that is that the weather company has to handle which is generated by these API calls the mobile sessions the page views and the weather observations this amounts to that they have to handle close to 360 petabytes of data on a daily basis so for this they needed a platform to serve this data to the users and this platform obviously it ingests and aggregates and this data they needed a way to crunch this data in a repeatable manner and obviously there are different users working on this platform so they needed to collaborate and shorten the time that it takes to go from data discovery to insights and the data has to be processed and analyzed in both batch as well as using streaming so they use a lambda architecture for that and of course Spark is at the heart of this so they make use of Spark streaming Spark Core Spark SQL and for no SQL store they make use of Cassandra and with Spark what they have been able to build is this platform which handles 360 petabytes of data it is linearly scalable and it is cost effective there is a session by Robbie Strickland today and you can learn more about it you know Robbie has been a part of the team at TWC that has made this happen now let's move on to Spark in medicine so in our Watson health portfolio we have a solution called Explorers and Explorers enables healthcare providers to build data lakes and then they use these data lakes to improve the way healthcare is delivered so Explorers is a platform that can collect, combine gain insights from data that is coming from a lot of data sources this could be clinical data operational or financial data this data is coming from both the sources which are inside the enterprise as well as from the external network and we call this the healthcare enterprise because it represents the convergence and standardization of big data that is both inside and outside the enterprise and Explorers has just started using system ML and let me just tell you a little bit about system ML at a very high level think of this as the sequel for machine learning and it also has machine learning algorithm optimizer so this is something that we had contributed to the open source and we are trying to integrate this with MLNib so on top of system ML which is running on Spark they have built risk models and these risk models now they are able to predict adverse medical events and alert the doctors in a timely manner so another client of ours this is a telco ISV and they are helping their customers which are large telecoms improve customer satisfaction rates so today the customers when they are interacting with the telcos or for that matter any business it could be over voice it could be email, it could be chats it could be web, live chats text or even tweets so often the same customer is using different channels to interact with the business and sometimes if they have to repeat the same information across those channels it gets frustrating for the customer so this ISV has built a platform that creates a 360 degree view for each customer they stitch all the interactions across all the channels that the customer is having with the business and this they call as the customer experience journey and then they analyze this journey data to extract sentiment and take any necessary actions so once again spark is at the center of this platform so spark streaming is bringing the data together into a single processing pipeline spark core is being used to perform text extraction and voice processing and they used to use map reduce for this before so by going to spark core they are seeing about a speed up of 4x times compared to hadoop map reduce MLI is being used for correlation and sentiment analysis and then they are using spark sequel to drive visual dashboards for interactive query and reporting so this is and the benefit is that the customer satisfaction rate is being increased so these are just a few examples of how the various elements of the spark stack are coming together to provide complete solutions to handle large amounts of data and be able to do analytics and deliver value to the business so for those of you who are already on spark I would say welcome to the revolution and for those of you who have a big data problem and are trying to decide which technologies would help you try spark