 Hello everyone, welcome to bringing big data analytics through Apache Spark to.NET. I'm Bridget Murtal and I'm a program manager here at Microsoft on the.NET team. So let's start off with what is Apache Spark? So big data means that there's an increasing volume velocity and variety of data. So let's take for instance a factory. There can be thousands of internet of things sensors in a factory, each producing petabytes of data. Now, while it's great to have that much data, so we can understand how our factories performing in ways to improve the equipment, how can we actually process it all when we have that much? And more than just that, how can we process it all quickly and efficiently? Well, welcome to the world of Apache Spark. So what is Apache Spark? Apache Spark is a general purpose distributed processing engine for analytics over large datasets, typically terabytes or petabytes of data. Put a little bit more simply, Apache Spark is a great tool we can use to analyze a large amount of data in a quick and easy to understand way. So we don't have to be data science experts to understand or use it. So there's quite a few different things that we can do with Apache Spark that are all super interesting and exciting. But just to touch on a few of them, one of them is Spark SQL, which means analyzing data that's structured in some way. So maybe data from a CSV or from a database. There's also Spark Streaming, which means analyzing data in real time as it's being produced. So in our factory example, it means analyzing data live as it's coming from those IoT sensors. So we can go and detect maybe if there's a malfunction in our data and we can go and address it right away. There's also machine learning capabilities with Apache Spark. So you can combine the powers of big data and ML to scale and have faster, more efficient training and prediction of machine learning algorithms. So to understand how Apache Spark works, there's only three main components that we really need to take a look at. The first one is the driver. So the driver consists of the user's program. So for instance, if you wrote a C-Sharp console app, that would be part of the driver. The driver also consists of a Spark session. Now what the Spark session does is it takes that user's program. So for instance, it takes that C-Sharp console app and it divides it into smaller pieces known as tasks. Now those tasks are divided amongst our second component, which are the executors or the worker nodes. The executors or the workers are on something known as a cluster. So each of those executors takes one small task, so one small piece of our user's program and finishes executing it. The third component of our architecture is the cluster manager, which helps with dividing up the tasks and allocating resources amongst our driver and our executors. So how can I use Apache Spark? It sounds super great, super useful, so how can I get started with it? So there's different APIs that we can use that are popular with Spark. They're written in languages like Scala, Python, Java, and R. But up until this point, there weren't any.NET APIs for Spark. So what if I wanted to use Apache Spark, combined with my pre-existing.NET knowledge or extensive code base and business logic? Well, we now have an awesome tool that we can all use, and it's known as .NET for Apache Spark. So .NET for Apache Spark is a free, open-source and cross-platform big data analytics framework. It allows us to reuse the knowledge, skills, and code we already have as .NET developers. So anywhere that you maybe have an extensive C-sharp or F-sharp code base, now you can go ahead and introduce big data analytics within it. .NET for Apache Spark is also designed for high performance. The overall goal for .NET for Apache Spark is to provide .NET developers a first class experience when working with Apache Spark. So we have had several customers express a lot of interest and actually see success with .NET for Apache Spark. One of them is the Microsoft Search, Assistance, and Intelligence team, who's working towards modernizing workspaces in Office 365. Their job is to work with different ML models on top of substrate data to infuse intelligence into Office 365 products. Their data resides in ADLS and in turns gets fed into their models. Why they were looking towards .NET Spark was because a lot of their business logic such as the different featurizers or tokenizers were all written in C-sharp, meaning it would be ideal to be able to use big data analytics still within the .NET ecosystem. So far, their experience has been extremely promising, stable, and they really loved the vibrant open-source big data analytics ecosystem within the .NET community. The scale of their jobs has been about 50 terabytes, so quite a bit of data, and they really started seeing success with it. So now that we've seen a little bit about what .NET for Apache Spark is and why it's such an exciting new solution for us, let's take a look at a few different scenarios that we can complete in some really exciting applications we can build using .NET for Apache Spark. So one of the most fundamental big data apps is batch processing. So what is batch processing or what is batch data? Batch data means that we're working with data that's already been stored. So for instance, we could be doing something called log processing, which means looking at and gaining insights from logs from maybe a website or a server or a network of some sort. So we can understand what actions our users are taking or what pages of our website are the most popular. We can also do data warehousing, which means taking in data maybe from a variety of different sources, and then performing large-scale analysis on it. Maybe that data is all stored in Azure Storage and then gaining different meaningful insights from it. So an example that we're going to be looking at today, we're going to take a look at some GitHub projects data. So you can see here in the snippet of that data that our projects data includes the URL of our projects, the author, a description, what language it is, things like that. We want to know on average, how many forks does each language have, and the number of times each project has been forked is represented by that column H there. So let's go ahead and take a look at our first coding example with .NET for Apache Spark. So I'm going to open up Visual Studio 2019 here. We can see here that I'm just dealing with a C-Sharp console application that I've already created, and I've already installed the Microsoft Spark NuGet package. There it is. Cool. We can see that it's also installed because I have these great jar files over here in the Solution Explorer. At the top, I'm using Microsoft Spark SQL and Microsoft Spark SQL Functions, because as I had mentioned, Spark SQL helps us work with structured data. So if I'm reading in GitHub projects data, that data does have some sort of pattern or structure to it. So I want to use Spark SQL. So to start off with, in my main method here, the way that we start off any .NET for Apache Spark app is by creating a Spark session. So that's what's dividing our program into the smaller tasks to be distributed amongst the executors. So we can see I've created a Spark session called Spark. I went ahead and built the session, and I just called my app GitHub and Spark Batch, pretty appropriate name. Now after doing that, the next step we typically want to do in our apps is to actually read in our data. So I have our data stored in a CSV, and we want to read that CSV into an object called a data frame. So a data frame is going to be the basic object that we store our data in when we're working with structured data in Spark. So if I open up this region here, we can see that I'm working with a data frame here, and I just called it ProjectsDF to stand for my projects data frame. I go ahead and called the read method, and then I also call schema, which means that I'm working with whatever pattern my data has. So for instance, I know that my data has an ID column, a URL, an owner ID, and I also know the type of data that's stored in there, whether it's an int or a string or something like that. It's a rather long schema because I do have quite a few columns. Then I can call .csv since I know that my data is stored in a comma separated values file. Then another popular method that's good to use is .show, which allows us to actually print that data frame to the screen. So I continue on. Another popular step that we'll want to do when working with batch data, is to do some data prep or data preparation. That means that we're cleaning up our data. So maybe if there's some null or missing values, or if there's some extra values, maybe a few extra columns we don't need, we can go ahead and remove those so that our data is easier to read and easier to work with. So one of my first steps of data prep was working with the data frame and a functions. With that, I chose to drop any values that are null. So any rows that have missing or null values, I chose to remove so that when I go and perform calculations later, I'm not accidentally trying to perform calculations on a missing value. Also with my data prep, I chose to drop a couple of columns that I don't think will be important for my final calculations. So I dropped the ID, the URL, and the owner ID columns. So after doing some data prep, we can actually go ahead and perform the functionality that we wanted to, which was finding on average, which languages have been forked from the most often. So the first thing I wanted to do was group my projects by language. So what I've done here is I've created a new data frame that'll represent now my group to data, and I called the group by method, which allows me to choose which column of my data I want to organize by or to group by. So I chose the language column, and then whenever I do a group by, I also need to call aggregate or this.agg method. With aggregate, it allows me to perform some sort of functionality across every row or every entry of my data. So in my case, I performed the average of the forked from column on each row of data. So essentially, I'm grouping by language, and then finding on average how many times each language has been forked. Then finally, I don't want to just display my data as is, I want to do one final step to make it a little easier to understand and read. So I've chosen here to order my data frame in descending order. So that way I have the top forked languages at the top of my data frame, so I can see those first. Then a final good step to do is to go ahead and stop our spark contexts. So I went ahead and called spark.stop, just to clean up resources and make sure everything finishes executing correctly. Okay. So now I have a few steps here that I'll need to be able to build and run my program. So one of the steps in working with the .NET for Apache Spark app is to make sure that we have one of our environment variables set correctly. So there is a .NET assembly search paths variable, and we'd want to make sure to set it to specifically my app, so in this case batch, and then the bin, debug folder, and then whatever version of .NET Core app that you're using. Then also, one other thing you can check is the level of logging that you have in your output. So I go ahead and open this up. So there's a file called log4j.properties, and here I've set the logging level or whatever is going to be output to my console to be the error level. So rather than displaying maybe some extraneous warnings or info or debug messages, I'm only going to display messages that are actually an error, which will help make sure that my output isn't too confusing or crowded in my console. So now, it can actually be time to go ahead and build and run our program. So fortunately, I've already done that for us here to save some time. So I'm going to open up the terminal, not that one, that one will be later. Okay. So here I moved into my batch directory using CD batch, and then I just went ahead and built my project using .NET build. We can see that build succeeded. So now let's go ahead and see how we actually run a .NET for Apache Spark app. We use something called Spark Submit or the Spark Submit command. And then every time we use Spark Submit, there's a few different components to it. So we say spark-submit, and then we're also going to reference the .NET runner. We're going to specifically reference one of those jar files. In our case, we're using Apache Spark version 2.4 and .NET for Apache Spark version 0.4.0. And then also we want to have a path to our apps DLL, so it can actually build and run correctly. So after running Spark Submit, let's see how our program did. So first, we're expecting just to see the data frame with our CSV data in it, so just the raw GitHub projects data. So let's take a look. All right, so we can see it here. We see that it has all the columns we expected, but we can also see that the data's kind of overlapping with itself. There's a few too many columns, like updated at, is all the way over here instead of continuing to the right. And then there's also a lot of null and missing values. So it seems like it was definitely a good idea to go ahead and do that data prep we did. So if I scroll down, now we can go ahead and look at our data prep result. And so now this data frame is a lot easier to read and understand. We can see the data doesn't overlap with itself. And all the data actually exists. There's no longer all those null values. So it's a lot easier to work with. So next, we can go ahead and see the output of when we were trying to calculate the average number of times each language has been forked. So let's take a look. And so we can see we have a column here for language. We have a column here for the average number of times it's been forked. And it all looks correct. And we can see that it did sort in descending order because the languages at the top have been forked on average more often than the languages down here. And then it's also worth noting that for all of those data frames, it's only showing the top 20 rows. And so this is really useful because in case we're working with terabytes and petabytes of data, we don't want to be stuck trying to show those data frames and then having it take forever or getting confused and too crowded on the console. Okay, so we have successfully run our first.net for Apache Spark app. So let's go ahead and go back to our presentation already. So we've already done that demo. So let's move on to our next scenario, which is combining machine learning with big data. So when we combine machine learning with big data, it means that we want to scale the training and prediction of machine learning algorithms. One great framework we can use for the machine learning when we're combining ML with big data is ML.net, which is a free cross-platform open source machine learning framework. In our example that we'll be looking at, we're going to be performing sentiment analysis, which means that if we're given a piece of text, we want to determine if it represents something positive or something negative. So in our case, we're going to analyze a set of online reviews and we want to know which are positive and which are negative. If we were given a review such as I love.net for Apache Spark, that would be considered positive. And we could maybe see either a true or a one, depending if we're using a Boolean to represent positive or negative sentiment. If we saw a statement like I hate running inefficient big data queries, that would be considered a negative sentiment. So let's go ahead and take a look at our sentiment analysis demo, where we combine ML.net and .net for Apache Spark. Okay, so I've opened up Visual Studio 2019 once again. And in this case, when I look at the NuGet packages I've installed, I haven't only installed Microsoft Spark, I've also installed Microsoft ML, which is the NuGet package we need to use ML.net. And then we can see at the top here, I'm using statements both related to ML.net and .net for Apache Spark. So we can see Microsoft ML and .ml data. We can also see Microsoft Spark SQL. So if I scroll down here, we can see that just like we had done in the batch example, we start off by creating a Spark session for our program. And I've just gone ahead and given my app a different name compared to the batch app. Next, it's also going to be similar to our batch example, since we are still technically working with batch data. We're just taking it a step further by also performing machine learning. So in our case, we want to read our review data into a data frame. So I have some Yelp reviews, so I have it in a Yelp.csv file. And I've also set a few options for my data frame here. So for instance, I know my data has a header. And so the two different columns in my data, which include the text, and if it's a positive or negative review, those columns are labeled. So that's a header. And I don't want Spark to treat that header as part of the data, because it could throw off my results. And then I just went ahead called show, so I could see my raw review data as is before we go ahead and actually predict using ML.net. So now it's time for the fun part, where we can actually start combining machine learning with big data. So how would we actually start calling the ML.net code? So how we can do that is using something called a UDF or a user-defined function. And so UDFs are a popular solution so that we can perform some sort of function on, let's say, each row in our data frame. So if I open this up here, we can see that we create a new UDF by calling the UDF method and then register. And within the angled brackets, I have string to represent the input that I'm working with, which is text or reviews. And then Boolean to represent what my output is going to be, which is a true or false for negative or positive sentiment. I've decided to call my UDF ML UDF, and what I'm doing within this function is passing the text into a method called sentiment. So we may be asking, so what is the sentiment method? Where do we create it? What do we do within it? So within sentiment, if I scroll down here, we can see that sentiment actually contains our machine learning code, so the code that was generated from ML.net. And I got this code, and I actually also trained my sentiment analysis ML.net model by using something called Model Builder. So Model Builder is a really useful UI tool that we can use within Visual Studio that helps us train and work with machine learning in a much easier and understand way. So just to see what Model Builder looks like, if I right click on my project and say, add machine learning, I can see that within here, I can choose a scenario. So I can choose things like issue classifications, sentiment analysis, price prediction. So in my case, I would have chosen sentiment analysis. And then I can just go ahead and choose my input file. So I could choose my input review data set to do some training on. And then ML.net does all of the training for me and generates some really awesome code for me. So if I go back here, this was actually code generated from Model Builder using ML.net. And what it's doing is essentially creating a way to start predicting. So it calls the ML model that was trained and created. And then it creates a prediction based on whatever strings I pass to it. And then down here, I've created classes to represent my review data, since I do need to pass that or work with that when I'm using ML.net. Okay, so now that I've gone ahead and created a function where I can call the ML.net code, I want to actually call that function. So what I've done here is there's this really neat functionality in .net for Apache Spark where we can actually execute SQL queries. So if you're familiar with SQL syntax at all, we can have those SQL queries within our code. So in my case, I've gone ahead and selected column one, which represents my input review text. And then past column one, so past each review to my ML.net method. And what I can do here is then just go ahead and print that out essentially. And then I called show so that I can see the output of my operations. And then similarly, just like we had done using the batch app, you'll want to make sure you set your environment variable correctly so that bin debug net core app folder. And then we can go ahead and go into our apps directory and build and run it. So let's take a look at how that came out. Okay, so you can see here I had built my project and built succeeded. And then I ran Spark submit using pretty much the same types of parameters. Just in this case, it had to be to my current apps DLL. So I scroll down. What we see here for the first data frame is that was just the raw review data. So these are all the reviews that were in my Yelp data set. And this is the true answer if it's a negative or a positive sentiment. So we can see when someone loved something that represented with a one. So that means it was a positive sentiment. And if someone said something was not good, that was a zero. So negative sentiment. So now the next data frame that we're going to see is going to represent the prediction from our ML.net code. So let's scroll down and see. And we can see here we're dealing with the same reviews. But now this is predicted sentiment. So you can see here that it looks like it was pretty accurate. We can see when someone loved something, it was predicted as true, so positive. And then again, when something was not good, it was false. So negative sentiment. So you can see here that we were successfully able to combine .net for Apache Spark and ML.net. Let's go back here. So now that we've done that demo, we have one final quick scenario to go through and this is with structured streaming or real time data analysis. So in structured streaming or real time analysis, we're working with live data. So data that's maybe coming in from a sensor, it's like an IoT factory sensor or a phone or a network. Instructured streaming uses the principle of micro batch processing. So essentially it takes our continuous stream of data and divides it into smaller little chunks, so maybe every five seconds represents a new batch. And then it can perform functionality on each of those smaller batches and then append the result essentially to a table that already exists. So then if let's say another five seconds passes, we'll have another batch, perform functionality on it and append it. Another batch performs some functionality on it, append it, and so on and so forth as our stream still exists. So in the quick demo that I'll show you here, we can actually do live or real time sentiment analysis. So still using .net for Apache Spark with ML.net. Now we can see that if I type a string into a console, let's say, we can determine in real time if it represents a positive or a negative sentiment. So let's go ahead and take a quick look at that demo. Okay, so I still have the Microsoft ML and Microsoft Spark NuGet packages installed. I start off creating a Spark session but instead of reading into a data frame from a CSV or stored data, now I'm doing something where I'm reading a stream and I have to set up the host, important information my stream is coming from. And then I still use ML.net with a UDF and I can still call the ML.net code in a very similar way. And then finally, as I'm like working with the data and displaying it, I use something called a streaming query. And I can go ahead and call write stream and determine that I want to write my stream to the console. And instead of saying spark.stop, we can do query.await termination. So if I go over here, I can see here that I've set up a quick netcat terminal, so just an easy way to read to or write from a network connection. And for instance, I could write something like, I love Spark. And then in my other tab over here, I've already built and run my .net for Apache Spark app. So if I scroll down, you can see that I've been working with my different batches here. So every time I hit enter, it's considered a new batch and it determines in real time if my line was a positive or negative sentiment. So you can see when I said I love Spark, it was considered true, a positive sentiment. So that's awesome, we have real time streaming working. Okay, so a couple quick steps for how you can get started with .net for Apache Spark. We have, if you go to the .net website, so dot net slash spark. You can go ahead and read even more about .net for Apache Spark. You can go through a really neat getting started tutorial we have for you can get up and running with .net spark on your local machine in ten minutes or less. And you can also visit our docs and see some other learning resources we have. You can also visit our GitHub, so github.com slash .net slash spark. So you can go ahead and view some of the documentation. You can see how things are implemented. You can participate in the open source community with Spark. So thank you so much and now I guess we'll turn to questions. Fantastic, so that was an amazing presentation Bridget. Now help me out because I am a little dense. Spark isn't like a database, is it? Is it where the data is stored or is it a medium for transferring data over? Help me out to set the context for these questions. Sure, yeah, so Spark isn't a database. So it's not where the data is stored. So you'll already have your data stored in something like somewhere in Azure. So maybe an Azure data like storage or in a blob or something like that. So Spark is actually kind of like the framework or the tools that we can use to start analyzing that data. So it allows us to read it in and to process it more quickly and to make different calls to it so that we can gain meaningful insights from it. Amazing, so it's like the pipe then that takes data from any scenario and then moves it over, is that right? Yeah, yeah, that's a good way to think of it. Fantastic, and so you mentioned a lot of stuff with .net. We didn't have .net isms for Spark. When did this start and if you were to point, because you pointed out tons of cool things, but if you were to tell people one thing, look at this first and see why it's powerful, what would you suggest to them? Okay, so we started this project. We first started it towards the beginning of this year. So I would say about in April, even though we did have a predecessor to this project a few years ago. But yeah, I would say as of this year is when we actually started having like these awesome .net bindings to Apache Spark. And for something really awesome, I would say if you check out that landing page, so dot net slash Spark, you can see really that you can really start processing terabytes and petabytes of data at a manageable scale. So you don't have to spend days and weeks and months and years processing all this data. You can actually start gaining insights from it in a matter of hours. That's pretty cool. Does .net for Apache Spark support using F-sharp instead of C-sharp? Yes, I believe it does, so .net ecosystem in general. Fantastic, well, this is amazing and thank you so much, Bridget. Now, here's a couple of things though. Before we go, I want to remind everybody about, I'm not finding it, but the actual tech challenges that are available. Let me get my handy notes out here because they're pretty good. And if you have more questions for Bridget, make sure you get them in. Make sure you participate in the technical treasure hunt. It's happening all day. Tons of technical problems that you can solve. Maybe even do a little bit of code. If you solve all of them, you will get a ton of wonderful prizes. It's pretty cool if you go to .net.com, front slash party. That's what we're gonna, if you go to .net.com, front slash party. Sorry, they're talking in my ear. You'll be able to see all the cool things. And there's a ton of things happening today. Make sure you go to the Apache Spark coolness for .net, which I think is pretty cool. We're gonna go to a commercial break here in a second, but, and after that, we're gonna have more servers less with Jeff Holland, who's gonna talk about durable functions 2.0, serverless actors, orchestrations, and stateful functions.