 All right, so hey everybody, Karl Epiglisi here. I like to keep it informal. I've been with IBM for a few years now. Prior, I used to work for company where I was director of innovations. I've always dealt with emerging technologies and trends. My focus at IBM is the global data science and big data evangelist for our platforms and our tools. So I'm gonna cover for you a little bit about things that are emerging in the space around data science, particularly how the team approach Apache Spark. I'll go in pretty deep why Apache Spark's important in the ecosystem open source tools versus tools out in the space for data science. And then at the end, I'll give you a quick demo. So how many people here actually understand, we know Apache Spark pretty well or deep with it at all? Hands-on, coding, Python developers at all? A little bit, okay, cool. And any R developers or data scientists in the room? Okay, good, this will be a good talk for you guys then. I was worried about getting to high level and then I'm gonna go too deep. So, well first, I talk to clients all the time. I meet with large, big clients and everybody's faced the same problem. We live in a digital age and you guys know this, but everything we do, we live, play, learn and everything's digitally now. And so the way customers are being handled and where companies are dealing with the digital evolution is different for each company. I mean, I work with companies such as large banks, Bank of America, JP Morgan Chase and these guys are going deep into data science and they're transforming their businesses but still their main money is based on banking but everybody's faced with how do they innovate and how do they keep up in today's market. And so we did recently a Harvard Business Review recently came together and it said that 72% of companies are vulnerable to disruption from digital businesses. So digital businesses are disrupting industries. I mean, you see it with the Ubers and so on but it's happening and it's real and so it's all being done with data science and application development. So that's kind of my thing I'm gonna go into here. However, data-driven organizations are really built around a team. Data science is really a team sport and we at IBM have focused in on trying to enable these four personas as well as the chief data officer. The data scientist is at the core, this is the person that builds your models, your algorithms, does discovery work, figure out new business models. The business analyst is the person that's kind of got the business expertise and they're the ones that are kind of identifying opportunities that could become available based on your data. The data engineer is really that guy that you deal with a lot, the DBA type who's got all the data, don't wanna let go of it. I'm more of a developer type myself but the data engineer is still relevant and their job is really bringing in that data, making sure it's secure, all the things that the developers typically don't wanna deal with. And then the application developer is really the true one that's kind of putting that in the hands of your customers and building the applications. However, in this time right now, there's really a shortage of data scientists because the data scientist seems to be the go-to persona or person for all these problems we're trying to solve, what the reality is is we're really just trying to solve, putting data science into our applications. And so a lot of companies are having data scientists be assigning silos and they're doing discovery, data modeling, development and then they work with the engineers, developers and data analysts. For them to be truly productive, there's a lot of draw there. And so what I wanna talk to a little bit about is how that particular role is becoming really a difficult job in the tools that are in this space and why Apache Spark is kind of changing that a bit and why the application developer will become more relevant in this role. So really like what is a data scientist and a data scientist I always talk about, they have three, by the way, I forgot to mention, I got three hoodies here, IBM Spark hoodies. So whoever asks questions first gets a hoodie and I only got mediums, large and extra large. Just so you know, one each. So if you have a question, interrupt me. But a data scientist really has three main skills and this comes together with, you have the statistical side, statistician, they understand algorithms, they understand statistics. Then there's the domain expertise. This is really the understanding the domain you're in, the business problem and the challenge you're trying to address. And then the programming skills or the computer science skills. There's no real one person that has all three of these. That's why data science is kind of a team sport and we see this in big companies. We have some guys that are pure mathematicians and they're very good at data scientists. They also maybe can understand the domain but they may not be as deep in the code and side of things. But then you have people that are more coders, developers who understand the other two or domain expertise but they don't. So data scientists could fall anywhere in this diagram and it's kind of what makes up a data scientist. Data science tools. So Gardner recently came out with the Magic Quadrant and I know IBM's up there so it's a shameless plug but if you see it's like they're all big vendors, guys that make a profit but there's no open source tools here. So my point here is to truly talk about managed platforms are provided but there's a variety of tools and there's a variety of approaches that are being solved. IBM obviously we have our SPSS product that's quite mature but we're also embracing open source is what really I want to cover in this talk. But if you look on Google Trends, this is not market share but you'll see that if you put, and Gardner doesn't put open source against managed platforms they just don't do that but if you put Google Trends and you'll see the interest the open source community with Python and R is grown considerably. Now SPSS and SAS really has managed platforms that are out there and that have quite a bit of market share but you'll see there's a ton of people using the open source technology and Scala as well is kind of low because it's a niche area but a lot of people have developed in Scala and doing data science work. Any questions so far? All right, yes? You know what the market share of open source would be if you were to compare it? You know that's a great question. There's really no data on that I know of on that. I know there was recently a great blog out there with one of the analysts who compared to Gardner report that just came out in February with all the open source and there's nobody, I think there's gonna be work done to that but they really break it down based on managed tools by vendors and open source is kind of separate so I don't know that for sure. I would think open source has a big market share too or maybe more but IBM tells me we have a good market share. I'm sure Sass tells the other guys they got a good market share. You told Gardner that, right? Well, Gardner is independent so you can't really buy those guys from what I hear. Hey, that was a question. Do you want to spark hoodie with IBM logo on it? Sure. What size? Probably larger. Oh, it's the last large. Let me see here. It's not for sale. It's not for sale? Oh, okay. You can't sell this. If you sell it, you gotta split it with me. There you go. At any point, I got two more guys I don't like to put them out. I got something at home, I wanna get rid of. So here's the open source ecosystem that I talk about this all the time and if you're looking for a job, seriously, I'm making good money, learn big data, learn patch spark, you can go either to R path or to Python path. I recommend Python, I particularly like Python better, but I'm the programmer, a lot of stat guys like R. But there's a, I did a search out and I didn't put a chart up for this, but I did a search out on, I think it's indeed somewhere in such in terms and these data science roles, they're hiring like crazy. Everywhere I go, there's a shortage of them and these are the skill sets that are really coming together well and even at IBM, our tool sets are really built around these technologies. So just kind of a heads up. But if you look at Python, you'll see, everybody here familiar with Jupyter Notebooks at all? Okay, good, then my demo's gonna be really well put. So Jupyter Notebooks, it's an interface that allows you to do Python development, which works well with Spark, PySpark, which we're gonna get into more depth around what a patch spark is. You know, Python's obviously a language and Skykit Learn, which is a machine learning capability as well as the machine learning capability in Spark, really are a good combination of tools that give you the ability to create some predictive models. And then R, you got RStudio and Shiny, Shiny's a great visualization dashboard tool. If you're familiar with it, they all complement each other really well and I'll give you a demo of what that looks like. So Apache Spark, I'm gonna be very heavy on Apache Spark here because this is ApacheCon, but we, for folks that don't know or folks that do know, but Apache Spark is an in-memory application framework for distributed computing and a lot of people, for iterative analysis, massive data, I forgot that, but a lot of people think of Spark the same as Hadoop, but it's really not, if anybody's familiar with Hadoop, what Hadoop is? So Hadoop's really a two-headed coin. It's got a computer engine as well, it's got a distributed file system and so they kind of work together. With Spark, it's really that computer engine, that's all it is, I mean, it's a lot, but it doesn't have the storage layer and there's a reason for that. And we consider it, we call it our analytic operating system. At IBM, we embrace Spark quite a bit and we're investing Spark a lot, but it's a analytic application framework that does computation on the fly and it scales out endlessly, obviously distributed. There's no limits if I've rambled that one on pretty bad. So Spark's really hot, I mean, it's one of the top projects, active open source projects. If you look at compared to Kafka and Storm and Flink and some of the other ones, there's way more contributions and commits, so there's a lot of activity going on with Spark. And there's a reason, I don't know why this is, there's a reason, Apache Spark, if you look at MapReduce, to write a simple word count with MapReduce took tons of lines of code, maybe a hundred lines of code, with Spark it's really three lines of code and it's just, it simplified the development against big data so much more and it's so much faster, that's the reason there's so much interest in it and I'm gonna explain why. So here, just a high-level overview, is anybody familiar with this slide at all? So this is Apache Spark, I'm supposed to stay by this microphone, but I like to wander. This particular architecture kind of talks about how, what Apache Spark is, so there's really, it's got a set of API layers that sit on top of a core compute engine. The core compute engine is quite sophisticated, it's much better than MapReduce, it uses a thing called a cyclical, a critical graph, it's a graph that manages all the transformations, so that when you run, you have two types of actions that could happen in your compute engine, you could have a transformation or an action, so every transformation it keeps tracking a graph and then at the end when you hit an action, it'll do them all together as one computation, so it's much faster than MapReduce, which spills back and forth to disk. I always tell the story of, if you have a file that has a million rows in it, and then the next line of code says, just sum up the numbers for the first hundred rows, why does it have to load that whole thing into memory and then wait for the next action, so it kind of waits and does it all together, so it knows I just need to read the first hundred rows and we'll give you back your results, so it's a better way of doing distributed computing and it's very powerful. The next thing that's great about Spark is it's got a variety of APIs which are all part of the same project, so if you look at Hadoop, it's an ecosystem of projects. You've got Hive, one project. You've got MapReduce, another project. You've got another project for Storm, which is Streaming. With Spark, you have Spark SQL built in. You have Streaming, which is built in. Dstreams is the API. You have Machine Learning capabilities, ML Lib, which is what I'm gonna be digging into in this session, and then you got Graph capabilities. And then it doesn't matter so companies have a variety of data stores, so you don't really need to care about what data store you use. You don't have to just use HTFS. You can use a variety of data stores, whether it's public on the cloud or it's on-prem, it works with all of them together. So you can pull a lot of jobs that I do. I pull data from a relational data warehouse and then I'll supplement with just a file on a distributed file system or I can put it in S3 or something and then put them together and do some analysis on the fly. So it's quite powerful. Any questions on the architecture? Spark? It goes into, so Spark has a cluster, just like Hadoop, so you install Spark on a cluster and it has tasks and executors that run on a bunch of nodes. And so the data will be pulled into memory and if it doesn't have enough memory, it'll spill over into some disk, so it does some disk. In the state of memory, is it only for one thing? Just for the scope of the driver application. So when you initiate, when you create a Spark context, it'll know, it'll be aware of the applications telling it what to do. And that Spark context, when the test makes a query which pulls the data from different sources, it'll create connectors into those sources, pull that data into memory, run the computations or iterate through it and then give back the result to that application. Does that make, that's a really good question. What size? Medium or extra large? Thank you. I'm gonna throw it at you. And so that's a really good question. And so Spark, is it part of the big data ecosystem, but it's really more, for me, I see more as application development paradigm. Any other questions? Could you use it against like a traditional database as well? Yeah. You should get the last one. You don't want it? It's extra large for us. Fat guy. Damn. Well, anybody that's bigger want to answer the question? No. Just kidding. So, now nobody's going to speak. They're like, I'm not bigger. I'll just keep it myself. It's just too hot in Miami. But yeah, you can use any data source. I mean, there's connectors. So, some connectors are optimized. So, if you have a distributed data source, for example, S3 is distributed, and the connector from Spark to S3 can co-locate based on compute and processing. Well, HDFS is like that. I know our object store is like that. So, some connectors are smart enough to know to put some computation with the location of where the data is persist, if you understand what I mean. But so, it just depends on the connectors. But some connectors to like relational data warehouses will go through one node, which means maybe you have a bottleneck depending on the size of your data. But Spark's really good for machine-learn application that doesn't really mean you have to have big data. And I always talk to some of the SPSS people and stuff, and they're like, oh man, I got a million rows to deal with. And I'm like, well, for Spark, a million rows is not a big deal at all, even just through, if it's pulling through, connected through one node. So, it's quite interesting. It's very different than your traditional way of thinking, and it's quite powerful. Good question. The other thing that works well with Apache Spark is the open-source notebook. So, notebooks, years ago, people used to use pen and paper. They'd have a thought, a formula, they'd write it down. They'd write the numbers down and put the results. Now it's all done in a browser-based application, so it's a way of basically writing a line of code to request some information and pulls it back. You see it, you visualize it. Then you can compute, do more computations on that and kind of iterate in a notebook style where you're just thinking thought to computer back and forth, and it's a really interesting way of data scientists working, and I'm also very good for developers as well, and I'll show you a demo of what that looks like. The Spark supports multiple programming languages. So, you can interact with Spark, whether it's, somebody's turning the lights off. I think you're good. Whether it's Scala, you know, Scala, Spark's written in Scala, it's Scala, the advantage of Scala is that when you, for everything new developed with Spark, it'll come in Scala, available Scala first. SQL, I don't know why that's there, you should take that up, but Python is another language which is growing in usage. If you look over year over year, Scala's going down. R is new, but it's starting to grow, and then Java's decreasing. Java's the harder one, there's more lines of code. So, what this is telling us is that if you look at the usage, it's around, the growth is around Python and R, and this is where the data science workload or data science use cases is coming to Spark. And then the Spark library usage, so data frames is taken off like crazy, and that's, again, ideal for that data science workload. And then you got SQL streaming, so they're all on the increase, and then machine learning's also increasing as well. And this is the survey that was done to see what APIs are being used in production. I expect next year you'll see machine learning grow more. So IBM's betting big on that Spark. So we have invested in the Spark Technology Center, which we've put, I think about 100 engineers there to do nothing but contribute into Apache Spark. We also use Spark in our portfolio. So not only are we contributing to Spark, we don't just do that because we like Spark, we're doing it because we are setting a lot of, we're putting a lot of Spark tools in our applications. Watson uses Spark, some of our health applications use Spark, so even our security applications use Spark under the cover. So we're using Spark a lot, so we're contributing quite a bit. Everything we do, we put out into the open source community, and we also help our clients use Spark if they want to just use Spark themselves. And when we created the Spark Technology Center, Ben Horowitz, who's the big VC guy, said it was like Spark just got blessed by the Enterprise Rabbi, it's pretty funny, but and so STC continues to grow. And it's just really show that we're committed. If you look at our contributions to Apache Spark, we're really contributing heavily around machine learning, Spark SQL. I mean, IBM created Spark SQL. We have a lot of mind share out there. So we've been growing Spark SQL, machine learning, and PySpark, which is the Python, and incorporating Python capabilities within to Spark for distributed. So those are the areas we focused in on, and R is becoming another hot area that we're doing a lot of contributions. And Databricks, the founding company, they have most of the committers there. So they have Matej, who's kind of the guy who invented Spark at Amplabs, which we support Amplabs. So they do a lot of contributions. We're not gonna beat those guys, but I actually got lucky once. I have a selfie with Matej. I was on a flight, and I was flying from Boston to Tampa. I live in Tampa, and then all of a sudden, this guy comes down, skinny guy sits right next to me, and I'm like, man, he looks familiar. And I'm thinking, I'm like, I started talking to him, and he goes, oh yeah, I'm in distributed compute, data science and all the stuff. And he goes, I'm like, Databricks? He goes, yeah, I'm Matej. It's like he's a founder, creator of Spark. It felt like a groupie. I'm like, I didn't do nothing but learn in Spark lately, so it's kind of funny. But he's a nice guy, and Databricks is a partner with IBM, so we work with them on a lot of projects, particularly GPU acceleration with our R&D. We've done quite a bit with those guys. So we're number two in contributions behind Databricks, but then if you look at all the rest, I mean, we have more contributions in the next five companies. Actually, these five, six companies here, one, two, three, four, five, six. Seven companies make up 70% of all the contributions to Apache Spark. 2.0. So, and the other thing I want to make note is, we're heavily invested in machine learning. We have been doing machine learning quite a while, and so we're contributing a lot in that space, and SQL, and R, and as well as Python. So Spark has been accelerating machine learning, and here's some of the reasons why. Spark is easy, you can code solutions much faster with Spark. There's less lines of code, there's multiple programming languages, so you can have a Python person get up to speed quite quickly, you can also get Java people up to speed quickly. Spark's also agile, and I look at that typo in there. Quickly, you can build pipelines quickly, and it's got a unified API library that is accessible and easy to use, and it supports notebooks, which make it even more usable. And then Spark's really fast, you know? You can iterate and train models quicker. I should have proofread this. And then in memory processing at scales, it's just, it's much faster. That lazy evaluation I talked about, optimized compute. So there's been, I'm telling you, the market right now around machine learning. If you're a developer and you want to get into data science space, start to play with machine learning, because it's just everybody's looking at machine learning right now. So what is machine learning? I was a programmer, so for me, I've programmed everything explicitly. So machine learning's really the use of data to where you can have computers act in without things being explicitly programmed. You're just using a formula. So it's really just, you guys probably all know this, but sometimes it's good to talk about so people understand it. And then the process to create a machine learning model, this is kind of the standard basic way to do it, but it's all about you bring your data in, and you cleanse and transform your data, then you train a model, you piled out some of that data as a train, you have some of that data for tests and cross-validation to get an error rate, and then you run it against the algorithm, you train that data for a predictor field, so you're really trying to identify one field that says, hey, here's all my data, here's one field, I'm trying to see what algorithm, how accurate it is to predict it. So it trains to get that algorithm against 80% of your data, and then 10% of your data, or you can pick any amount, 70%, 60% of the data, but you train the data against there, and you try to build that model, and then you validate and test it and get an error rate. So it's quite basic model, and the reason I'm going through this is because we've been putting together application development interfaces to let people do this simply so that they can quickly embrace machine learning and build an endpoint that they can quickly deploy, which I'll show you guys. So to data scientists, while it's a sexiest job, it's also one of the toughest jobs right now. I've known a lot of guys who get hired and quickly quit and go to another place because it's just a lot of the problem is more than just the data scientists. So there's the tool sets, there's so many tool sets out there, so it's really difficult for them to be successful, and their approach is really limited by the tools that the company's kind of embraced and used at the end, which has been a challenge. You have fragmented and time-consuming, a disjointed environment, so everything kind of takes forever to get your stuff together, and then your analytics silos, which you have different organizations, they have different data sets, and it's hard to bring that together. So data scientists has a tough job. So here comes the sales part. So our mission is really to make data and analytics simple and accessible to all. So we built this thing called the Watson Data Platform, and in there we built a data science experience, which I'm gonna show you a demo of, and data science experience is really a collaborative space that lets you learn about data science, particular algorithms, Python libraries. You can create different analytical assets or different analysis, and then you can collaborate with other data scientists. So it's a pretty cool environment. And what it does is it really brings together three things, community, which is the place for tutorials, data sets, and different things for education. Open source, we bring together Spark, Jupyter Notebooks, Python, R, we have Shiny and RStudio, and then we added some of our IBM products, the capabilities that were done over the years, and things like data shaping, piping UI to simplify the process, auto data preparation, and advanced visualization. So we kind of pull that all together into one platform. And what this has allowed us to do, and kind of my pitch to you guys is that, if you're a developer, you really can become more embedded in that team with the data scientists, where you can actually supplement them and do some capabilities and deploy them out into applications without the truly need for that data scientist team. You kind of become the data scientist. But I'm not saying, I have had a lot of clients yell at me if I say, we can automate the data, application developers can automate data scientists and everything. No, you can't. So it depends on how you view it. I think really once you begin to learn to do some of the stuff you are a data scientist, so it's kind of like back in the day, I don't know if any of you are old enough, but you remember back in the day when everybody was webmaster? You remember? Everybody was a webmaster. Now everybody's a data scientist, so you're a data scientist. Any questions? I'm gonna do a quick demo or what do you guys got? Thoughts? I got a hoodie here. Are pricing for the tool? No, it's just, so we have a couple of ways that we price it. So for companies, we have enterprise license where you have five seats and so much storage is on the cloud fully managed. You have a 30 node spark cluster and you just pay so much per month. And then we also have individual plans. Right now it's free, so data science experience, you go out there it's free, you get like a little bit of storage and you can try it out and we're working on a freemium model and then if you wanna like add more computation and more storage, you kind of just kind of put a credit card in there and it kind of is up by consumption. So we got a few models. If our big clients, we kind of just give it to them and say, here, pound at it and tell us what you think so we can grow a product. Cause a big part of what we're trying to do is make sure that we get people to use this open source technologies and start to embrace it and put it into their applications. So we're big, I'm the global sales leader for this. So I work with OM and Dev and drive in our go to market. And right now we have quite a few clients using it but we just went to market with it late last year, November. And now we're just ended this month we're releasing our Watson machine learning which is what I'm gonna demo for you and some other capabilities. So we're expecting that Q2 and three, we're gonna hit quite a few deals that clients are gonna, right now I got, I think I got like 40 customers on it. So it's grown pretty quick. Let me show you it. So for each of you, go to data science experience.com or I know it's just, it's just, it's just, sorry. Data science.ibm.com and you'll go right here to this landing page and you just do sign up and you get a 30 day trial right now and eventually that's gonna stay free for education and just usage. And then as you do more enterprise or larger volume projects, it'll cost you something but I'm gonna just sign in. So it's a fully managed environment. And when you land into DSX you'll see you go to community page and here you can search on articles like I wrote a few labs on Spark. So if you wanna learn about Apache Spark you got Apache Spark lab one in here you can quickly create, use in PySpark, create some RDDs, some quick transformations and actions it'll teach you about Spark but there's also lab two right here which is query. So you spark SQL and lab three is all about machine learning. So I recommend you go in there and in those labs if you look at them this, you'll see it has a little bit of a description but it also has a notebook which has all of this description which as well as code in here to show you how to actually code with Spark. So it's a great tutorial and for some of our clients I'll do them on site too but you don't really need to you can do it on your own. So let me go back. And you can search on different data sets. So do you have a question? Did you have a question? You can, it's very stable to do, I mean a lot of people have Spark in production you know, but everything's kind of programmed so you mean, what do you mean by changing input? So if you, yeah. Yeah, yeah, you can you'll have the, you can add your data to the environment so we can actually have this new frame on this feature model. Yeah, so we have something that we yeah, we have the capability to set so the question was how do you refresh your models so that changing data refreshes to it? So we have intervals that we define where you can have it retrain every day or every couple of hours and we're also working on capability and this is part of the value out of IBM. Today with Spark you'd have to create a Spark submit job and you'd have to rerun it every day and schedule that as a cron or something. We kind of built some applications around that but you also have the, we're working on capability so as data goes through this in real time it can continue to score and improve and as well define some alerts that says if it goes out of a certain error rate once we have a feedback loop which says hey, we predicted this but then it ended up being say this value and we get that feedback loop we're gonna monitor the error rate and if it goes out of some type of KPI then we'll send an alert because this is all about our model management strategy because companies are building models and they're deploying these endpoints on the platform and then applications are just calling it and so we gotta keep up with that and monitor them and so we're working on a whole strategy around that and that's all coming out this year and I'll show you some of that capability. Do you want the hoodie? It's all extra large though, it's too big for you. All right. Did you want it? You got a question. All right, what was your question? So Spark, I'm working with companies like General Motors where they have 500 petabytes of data and we're able to do some Spark jobs so it's pretty mature for whether it comes to dealing with large volumes of data. Yeah, you can for sure that's why you gotta design your Spark cluster for the size of the data that you're kinda gonna be dealing with. There is some, so your cluster manager can manage some of that and spill some of the disk but you can't get memory errors if you blow up a small Spark cluster. We have some things we've built in there for optimization based on the data science experience so that people don't hit that but yeah, that's still part of it. It's not as mature I guess is what you're saying as like a relational data warehouse which won't let you do anything if you're gonna blow it up, is that what you mean? Yeah, I agree with you there, it's getting there. The catalyst which is an optimizer built inside of Spark allows data frames to be optimized and when you do Spark code in Spark and you stick with data frames and that catalyst will continue to grow I think and you'll see it but it's not a database, it's an application. So just like in C, if you code an application you could blow it up, I mean it's like it's really still more of an application than it is a database, that make sense? All right, so here, let me show you everything that we do, some reason this because of my resolution, my bars are different but everything we do is under projects so if I go to view projects, we've created a space that allows you to create a project and add collaborators, so like I have a project here I call it Focus 5 Machine Learning in five minutes but I've put in here a bunch of my, you'll see once you're in a project it's organized by analytic assets and this is kind of your notebooks which I'll show you what they look like models and flows which are really SPSS flows data assets which are really data files or data connections that we've kind of organized together here bookmarks are really articles, deployments these are endpoints that are built to of like the model that I built that I wanna call through an API and collaborators, here are folks that I've given them access rights to my project I created it and I gave editors and viewer access and then sedans and if you look here it's just some details around the project storage that's used, associated services like I have a Spark service in the background on our Bluemix platform I use as well as our IBM Watson Machine Learning service and you could even define a service here if you wanted to like an AWS EMR Spark service so you have some flexibility and then you have access tokens it's for security which I don't really deal with a lot no we got a whole team that deals with security then you got GitHub integration and I don't know what this is to be honest project scope but let me go show you something real quick with analytic assets so here I'm just gonna show you you guys familiar with TensorFlow trying to think of what notebooks show you so here is an example of deep learning with TensorFlow so this is what a notebook looks like so when you open up a notebook you'll see it's a space where you can put some documentation as well as have code and cells so right here I can create a code, a cell and let me show you what this looks like from scratch so if I add a notebook I can create a notebook and then here I could pick the language I can do Python, R, Scala or Python 3.5 I could pick a Spark version I pick my Spark service and then I could just create a space and now I've created a Jupyter Notebook and in the background I have automatically created for me an instance for a Spark cluster as well as some Python libraries that are installed and ready to go and if you can see the navigation is along the top here so if I go here I can say oh look I have, I can load data files in I can also define connections but here is a data file and then I'm gonna say, I'm gonna take this data file and I'm just gonna go ahead and create insert a Spark session data frame and it'll create the code for me and then if I run this I've now taken that data file it's now running against Spark cluster you see the ashtray it'll come in a minute and it'll have a data frame available and you can see the job right here is running Spark job so it's kind of a cool interface and it allows you to kind of iterate back with the Spark cluster as well as your data and do some data science it's just a Jupyter notebook but for those that know it but it kind of adds some simple easy to use capabilities the other thing that's nice is if you look here you'll see this is my notebook but the environment so I have a Spark environment run in that's accessible so you know SC tab you'll see the Spark context is already available I don't have to create it, I don't have to connect it and then here I have all these Python libraries so I have Skykit Learn I got Mapplotlib Pandas, they're all available I can even got PIP so I can do a PIP install and I could show you some for example, like Google's TensorFlow on one of my other notebooks I did a install of TensorFlow and then I did a style transfer which I can show you what that looks like any questions yet so far? No, all right let me show you what the style transfer looks like are they made familiar with style transfer in TensorFlow? All right, cool, this one's fun so here's a deep learning example using the TensorFlow library and but really my sessions around machine learning I got to get to that next four round of time I got 10 minutes there it goes the other thing you could do too I forgot to mention is you have the capability to share so I could share some of these you could also there's a scheduler which I have to go back to the other one and you can create chat like capabilities here I can add comments so that while I'm sharing so for example, you create a project I go in there I create an analysis as a data scientist say I'm trying to figure this out somebody else can look at it and say you could say I need more data sets and they're like okay and so we can collaborate and share and create checkpoints with the notebooks so it really creates it like an IDE for data scientists and developers but here in TensorFlow I don't ignore all the codes I'm gonna put that as imports eventually but basically it's using a deep learning algorithm to take two pixel images that doesn't know anything about them and other dots you know, the pixel image and it's saying okay using this deep learning algorithm take these two together and let's merge the styles so as you iterate you see 10 iterations you start to see the new image that's created just using an algorithm after 20 iterations it gets better and after 30 and 40 and then 50 and then 60 iterations you'll see and now I have the incredible Hulk with this style so I took this image with this style and made that and a lot of data scientists nowadays are doing really important work like this like taking a picture out of a cat and make it look like an art deco but it's just kind of it's kind of a neat use case here but let me show you a little more practical I'm sorry the other thing nice about the Eastense experience it really creates a space for you to load data in create connections so it's kind of organized your data for you so I encourage all of you guys to play with it and start to learn and do the tutorials but here's what I want to show you so under your analytic assets so if you were to code a machine learning model and deploy it you would have to you would have to create quite a bit of spark code it's not a lot but you have to create some spark code and here's a notebook that does a prediction of outdoor equipment that we collected through an API you know of purchases on a website so you'll see here it kind of you know import PySpark and then you'll and I'm going to go quickly through this but you create here you can see the scheme of the data that shows you the records you have product line, gender, age, material science and profession and based on that we want to do predictions so then you would create a ML model so we'd create a model a few lines of code we'd have to split that data into 80, 18 and 2% I don't know why they did that but so and and then you would train that data and test and validate and you kind of run that against your spark ML libraries which are right here PySpark ML and and then you can come down and you could print your schema let me see here and you can get your prediction your accuracy an error rate so so it's more accurate than flipping a coin it's like 58% accurate but it's not that great but anyway it's just a good way of showing how you can do that and then you have a way to with Watson Machine Learning you can create a deployable endpoint through our services so that you don't have to kind of redeploy the whole application yourself it simplifies it for you but let me show you how to do this simply with a GUI so we've created a way to do this with a GUI so here I can take and this is a sales predict so here it's part of the pipeline this is your pipeline on the left and you could just pick some sales data you can drop it in here if you wanted to within the data you can just drop files in it becomes accessible so here I created data I'm going to prep it you'll see it's the same data I could pick a auto data transformer or I could just pick some can transformers you have quite a bit of them that you can play with then I can do train and where this works is you pick a predictor field I'm kind of not in edit mode but I'll just talk to it and then once you pick the predictor field it gives you it kind of a it tells you an approach this one says multi-class classification you pick some estimators which really are nothing more than just the algorithms available and we'll continue to grow this this is right now in beta and it's used in all Spark ML and then here at the bottom you'll see you'll say I want to set a sixty percent train twenty percent test went to hold out so it's you don't have to kind of code that and then you can go to evaluate and it'll bring you back it'll train you run it's a couple minutes it'll come back and give you the the results of them and oh it says poor but again you get same error rate it was the same data you'll see that that's that's the best part once you get to where you're happy you click save and I have one saved here trying to go fast so once it's saved you'll see it says oh look I have a uh... IBM Watson she learned service product line I have a random for us uh... with spark ML classification of algorithm available and then I can do a one-click deployment kind of very simple just i want to do online deployment and what it does is it'll create you a and rest API that you can access with your applications so once you have that you have this scored endpoint that you can call your applications and it can do predictions for you and uh... it's it's quite uh... it's quite powerful and easy to use and then if you look here you just do a simple test so i'll say let's say a female age thirty kind of my import single as professional i run the uh... the against the uh... predictor and it'll say uh... sixty eight percent likely to buy personal accessories my okay cool so let's say if she's married right when she did it differently predict and it's less likely to do personal accessories and by other stuff so so it really is a cool way for you to just take pure just data taken in here and this is in beta it's coming out here in a few weeks yes right yeah we're right now we're using all the spark ML uh... algorithms but we're working on there's some projects out there we can plug in here that we can pull your own algorithms in but most people use the can ones right now yeah would be great for academia for sure and uh... i'll check because i know that uh... where we're adding algorithms to spark ML we're also including sbss algorithms that we have available to us in here that are well-known and usable so works continually expand the algorithms that are available and then you can just see right here i actually have a gooey so this is just a dashboard shows that here's the uh... service and you know he kind of just showed some customers and generate predictions and for alexander uh... yes for alexander who's a male thirty six single you buy camping equipment this is all kids really for an outdoor store is what the data was not a great use case but it's kind of cool but this is just in a simple no-jus application so with my last minute any questions no more amount of hoody i got three in my room i'll bring that i should have run down uh... i gotta get rid of it i gotta go fly to san francisco thursday and i'm doing a briefing in san francisco they're gonna fly all the way back home the florida tampon my wife's birthday saturday yeah absolutely yes well you could do there's a couple ways most people uh... share notebooks like i i have on get hub a ton of notebooks but you can uh... one you can download them uh... right here into i pipe python notebooks and you just hit this download button and if you'll see it'll save it as a standard it's just i p y n b file which then you can import into any jupiter notebook so it's all the open source stuff and you could also share it as a just a uh... url right here that people can look at now you have to take it from html to pdf uh... but you get it you can send an html the other thing that i forgot to show you is that there's a lot of cool visualizations i've got that i promised uh... david he's got a talk tomorrow at eleven on pixie dust pixie dust is a visualization library he's going to go much deeper into visualization libraries on uh... in notebooks but you can see here here's an example of brunel these are these are quite extensive i mean you could just build this this lot of these visualizations in jupiter notebooks uh... we've actually open source the brunel libraries which are came out of cognose pixie dust is another library that we've open source which are visualizations on jupiter notebook so you can build these visualizations and then share them as html pages so uh... it's quite interesting the nice thing about it is you have the code uh... and the code that was to generate it as well as together as a visualization of the results that's kind of what's the power of of these notebooks uh... within the uh... the state of science experience and then all of the computation analysis happens on the back end in the sparkluster did you have another question yeah i'm at a time thank you guys