 Hi, I'm Savin Reddy. I'm going to show you Azure Data Lake and Visual Studio work together to make big data easy. Azure Data Lake is made of three Azure services. The first is Azure Data Lake HD Insight, a generally available service for using Hadoop clusters in Azure using Windows or Linux, and two new services that are currently in public preview. Azure Data Lake Analytics, a highly scalable dynamic query service, and Azure Data Lake Store, a hyper-scale, HDFS Store optimized for compute performance on big data workloads. Now, what we've heard from customers is that doing big data is challenging. It's very hard to author, monitor, optimize, and debug, and we've taken our years of experience building a massive exabyte-scale big data system that thousands of developers are using every day and are bringing that to you in the form of Azure Data Lake with two new services, Store and Analytics. In the following demo, we'll focus on how data lake tools for Visual Studio makes it easy to see performance hotspots, and how to trade off cost versus query execution time. Let's take a look. So what we're looking at here is Visual Studio 2015, and I've installed data lake tools for Visual Studio as you can see in the menu. What I have open is an actual big data analytics job that I ran a week or two ago. Now, this job, as you can see here, took about 40 minutes to run, and you can see from the state it actually succeeded. Now, let's scroll down and learn a little bit more. This job read about 11 terabytes. It wrote about 10 terabytes. This isn't a particularly large job. Big data analytics jobs can do terabytes, tens of terabytes, hundreds of terabytes, and even petabytes. Now, I wrote this job in USQL, and I'll show you that in just a second. But the key thing here is that whatever I wrote, the USQL compiler and optimizer split the job apart. The total work was split apart into 12,000 chunks, what we call vertexes. When I submitted the job, I said reserve 1,000 units of parallelism or 1,000 nodes. So in short, try to do those 12,000 things 1,000 at a time, if possible. So those 1,000 nodes were spun up for me, just as the job started. The job ran, and the 1,000 nodes were de-allocated when the job finished, and I only paid for as much time as it took to run the job, so I paid for 1,000 nodes for about 40 minutes. Now, let's take a look at the USQL script. Scroll down, I'll click on Script. I won't spend too much time here. Just point out a few things. First of all, it uses an assembly. This is just some C-sharp code we wrote, and we installed it in the USQL catalog, and it contains helper functions and some other pieces of code that we re-use. You can see that USQL has embedded C-sharp expressions right in the language, and the rest of the script simply reads some data and writes it, and it uses some of the code that's in that assembly. Now, what you're not gonna see in this script is anything about clusters or nodes. You write these USQL scripts as if you're writing on one machine, and you let Azure Data Lake Analytics take care of parallelization, and it happens automatically. All you have to do is say how parallel you'd like to make it, how much parallelism you'd like to reserve, and that's 1,000 in the case of this job. And that script is a logical, a purely logical interpretation, but it transforms the input data into the output data. Again, it says nothing about clusters or nodes or anything else. That's the logical plan, and the green thing on the right is what the compiler and optimizer produced. It's the execution plan, what we sometimes call the physical plan. This is what actually happened. Now, it may not make any sense. You probably get the idea that there's inputs, outputs that it transformed here, but the key thing this tool tries to help you with is how do you optimize? How do you spot where to focus your time on? And the simplest tool we have to help you understand that is this Playbutt. So as those 12,000 vertexes ran, we recorded every time they started, ran, and stopped. And that's called a job profile, which we load here, and now what you're not seeing is a movie. It's not a movie. This is an actual visualization of what actually happened in the job. And just by watching this 30-second animation, which again is not a video, you can get an intuitive sense of where time is spent in your job, right? And it wouldn't matter if this job took a day or 15 minutes, it's always playback in 30 seconds. Now sometimes even you might be wanting an answer faster than 30 seconds, where are the hotspots? So we have built-in heat maps to help you understand this. So for example, if I click on display here under progress, which is what's being shown, I can switch to a number of other different ways of slicing this. I'm gonna pick execution time. And now the colors tell us something. The more red means more time is relatively being spent in those places, and if it's more blue, less time is spent there. So we can see where more time is spent in stages like this, this, this. Wherever it's more red, and I'll zoom in, and let's just try to explain what it means a little bit. Now, there are three stages here. They've been compressed here, but the key thing is that of the 12,000 vertexes, about 935 happened here. And on average, each one took about 50 seconds. And this read something like 200 million rows. And so how you wanna think about this is, hey, if I had had 1,000 vertexes, or 1,000 units of parallelism, I could have done this entire stage in about 50 seconds, right? So let's zoom out and think about that for a second. Should I pay for more parallelism or less? Will it be worth it for me? Will the job actually get faster? Or will it be wasting parallelism? Now fortunately, we have some very clever developers who are used to helping people analyze performance for these big data jobs, and we've baked in their expertise. So without you asking, we've already run these diagnostics. And this job doesn't really have any particular problems. But let's go over something that's quite interesting about this job, which is this one thing called resource usage. It's green, there are no real problems here, and the question is, can we make the job better? Now I'm gonna turn off some of the debug information and just focus on two things. This blue line, the blue line is set at 1,000, because that's the 1,000 units of parallelism that I asked for, the 1,000 nodes. The red line says how many of them are actually being used? So everything under the red is actually being used. Everything over the red to the blue line is waste. So already we can see the job took around 35, 40 minutes, but a substantial portion of those nodes were unused during that time. So this job is a little wasteful. Could we make it better? Well, we have a usage modeler here. I'm gonna click on that and it's analyzed the job based on what happened and some predictions. And it's not completely perfect, but it'll give you a really good idea of how to start optimizing and playing. So what it's telling me here is that if I had picked 2,425 units of parallelism, this job would have completed in 1,400 seconds. Okay, now that's the fastest it thinks it can be done, but notice there's tremendous amount of waste that's the area above the blue line. Let's play with the number and see if we can make it better. And what we want is the blue line to be high, so less waste, but we know we're gonna trade off time for that. So let's pick 500. This is much better, there's much less waste, but the job is gonna take now about 2,100 seconds. How about 100? Now, see, there's very little waste, but the job takes about 8,700 seconds. So it's tools like this we're baking in to help you understand, do you wanna pay more and is it worth it? So I hope you've enjoyed this demo and I hope it shows you how we've helped make big data optimization easy for developers. I hope you were excited about Azure Data Lake as we are after watching that video. We have some resources for you. You can go to channel nine and see a series called Azure Data Lake and a show called Data Exposed. Many great videos are there for you. You can go to github, github.com slash Azure Data Lake, links to everything you need are there. If you wanna give us feedback, suggestions, feature ideas, go to aka.ms, ADL feedback, and you can enter your feedback on our Azure User Voice site. And finally, of course, install Visual Studio and Data Lake tools for Visual Studio. Thanks for your time. I appreciate it. I hope you have a great time getting into big data.