 Hi, I'm Michael Reese. I'm a principal program manager in the big data team responsible for you sequel It is my pleasure today to give you an introduction to you sequel for the C-Sharp developer The first question that we are asking ourselves is why you sequel there are already a lot of big data Processing languages out there and you sequel is the new processing language for big data that we are shipping as part of the Assure data lake offering in the Assure data lake analytics job service. So why a new query language? Why don't we use the existing sequel dialects or use one of the existing programming language based approaches? Well, let me take a look at some of the characteristics of big data analytics use cases and kind of look at those and then Use those to assess the current languages as well as motivate why we need a new language So if you look at the use cases that you have for big data analytics such as botnet analysis in a digital crime unit where you analyze Web logs with complex algorithms or do feature extractions and classification for doing image processing Even if you do shopping basket Analysis to do shopping recommendations using algorithms all these use cases are big data Analytics and data preparation use cases that all have kind of the same requirements They require to be able to process any type of data be it structured being unstructured They need specific property custom algorithms that allow them to analyze the data In a deep domain specific sense and they need to be able to scale that Processing over any size and be efficient at doing so So if you take these Requirements and look at the existing languages that exist in the ecosystem today You will notice that there might be a need for a new language So if you look at sequel languages for big data such as different varieties of sequel and hyphen Etc. They are really good at providing you an abstraction that Makes the system care about the parallelization and scaling and you as a programmer don't on the other hand the extensibility of adding your own user code is very hard and it's also hard to work with Data that is not necessarily fully structured. Yes You can do structure on read, but you still have to create metadata objects Which in an enterprise environment often then takes away the flexibility from the programmer and moves it into the hands of a database administrator If you look at the programming languages for big data such as MapReduce Code CodeFlow or things like Spark It's very extensible because yes, you have your whole user code available and so therefore the extensibility is native However, declarativity is bolted on and not native often users have to care about scale and performance have to manage the Scaleout the polygyst themselves know how to shuffle and how to deal with the parallelism, etc And even on those languages that add an additional level of abstraction such as an embedded sequel sequel then is often a second class within a string so you don't have really the tool support and Often also you don't have really assistant that give you code reuse and sharing across queries because it's code and not a script that you're submitting So we think that you sequel is giving us the benefit of both by giving us the declarativity of sequel Combined with the extensibility of a programming language and offer that in an equally native experience in the language So besides getting the benefit of both it also makes it easy for you by unifying That's where the you and you sequel is coming from unstructured and structured data processing the declarative sequel and custom imperative code Local queries in your local system as well as remote queries into other data sources such as the sequel database And therefore it increases the productivity and agility from day one and that day 100 for you So you sequel however is not a completely new Language that we devised that just in the last couple of months you sequel actually has a long history of Background languages on which we base the language it starts out with scope Which is our Microsoft internal big data language that is giving us the sequel and C sharp integration model and the optimization and scaling Framework in which we execute both scope and you sequel today right now Scope is running 100,000 of jobs each day within Microsoft that runs basically the Microsoft business in Bing Xbox office 365 and the lot are offered other teams inside Microsoft In addition, we also of course looked at the long history that both we and the community has around sequel Both in the skies as T sequel in sequel server as well as the nc sequel standard And we have integrated many of the sequel capabilities into you sequel such as windowing functions the metadata model Etc etc etc etc So let's focus today in this presentation on the you sequel extensibility model so specifically how is you sequel integrated with the Imperative aspects of C sharp and show you how you can extend your you sequel curry language with C sharp expressions in particularly we have three user defined extension points as well as the native integration of C sharp Using C sharp expressions in select expressions Today we will be focusing on the C sharp expressions and the user defined functions We can also cover user defined operators and user defined aggregates in some of the resources that you will find at the end of the presentation So now let's go and actually take a look at you sequel So let me show you how to use you sequel to do some quick data preparation and analytics on some Twitter data That I loaded into my Azure data lake account so what do you see here is is visual studio and in visual studio we have installed the Azure data lake tools extension Which gives me this data lake integration and the ability to write you sequel scripts from within visual studio. I also have down here my set of comma separated value files that contain tweet data's of me and some of my colleagues and friends That contain both the tweets as well as all the tweets that mention me or retweet me So now let's take a look at one of those files So I just basically open up the file using a stream previewer here And as you can see the file contains basically four columns a date a time the name of the Twitter handle of the person tweeting plus the tweet itself that may contain mentions and references to other people as well So now let's do some Analytics and data prep for later so that somebody can do some more in-depth social analytics or something on my tweets so for that I open a new project and in particular I choose a new project that has a you sequel template and There are a couple of different templates. I'm going to go with the you sequel project template That is the actual place where I write my you sequel scripts So let's open this and this now has opened a you sequel solution over here and the you sequel Solution contains basically a you sequel file that is script of you sequel and Behind it. There is a dot CS file Which is allowing us to basically put code behind C sharp code that gets deployed together with the script to run my queries So I'm also having a submit button up here plus my you sequel account name and my default database and default schema Although that's not relevant at the moment because I'm only operating on files So let's start and write my first you sequel script the first step in a you sequel script is is that I need to Extract a schema from the files that I want to operate on so we saw it was a tape tap delimited file or CSV file So let's add the date of type string and as you can see as I type Visual studio gives me intelligence for my you sequel language and it offers me like the built-in C sharp types as The ability to schematize my data. So I say date time string author string and my tweet now I do that from a location and I need to extract that using an so-called extractor and there is already a built-in extractor for CSV that I'm just referring to here Now of course, I need to be able to extract this from a file. So I go back to my file explorer right click Get my path name here and insert it into the location Now I don't want to do that for just one file. I want to do that for all the files inside the directory So I can put in here C sharp like pattern that allows me to basically Extend the path over all the files inside that directory So that's the first thing the second thing now of course is I do some processing on it I will get to that in a minute and at the end I will have to write the result back into a file because we are in a batch service here So I say output at t2 and let's just say samples output tweet analysis dot CSV and use the Default built-in Outputters for CSV data. So this is actually already Valid you sequel script. It doesn't do much. It just copies all the files concatenates them into a single file here So let's now do some more interesting analytics on it So one thing that the tweets contain is mentions So let me write some you sequel script that basically gets me the count for all the Authors plus also the counts for all dimensions into my result. So what I do here is I say first I extract all my mentions. So I call this Expression and at M and now I start doing a select statement over the row set So I say select from at t and then I will have to tell you what I'm selecting here So obviously what I do is I say tweet Dot and as you can see intelligence is giving me access to all the C sharp functions that I have on a string Data type in this case I want to first split it into words. So I say split with a white space And now I only want to have the mentions mentions are all these words that start with an at sign So I now put a link where here with a lambda expression That basically just checks that I'm starting with The at sign and now this is a column. So I have to give this a name let's say M for mention as well and I have to wrap this into Type that you sequel understands in terms of array. So we have a built-in sequel array string type and Of I go now this M is only dimensions. I need more. I need this resulting T Which is not the original T. So now I have to kind of do some union here So what I want to do is is I want to actually pivot my arrays into Individual row, so I do this by refining my M to be a new M that is defined as the following query, which is basically Author from The old M now cross-applies a sequel statement that applies the next expression explode to each of the rows in my incoming row set and basically explode now pivots an array into a row per item inside the array. So the Input there was called M. So I say explode M and now I give this name as well T for table or whatever and author was the Neatly that I had and now I union this in my next statement again into my T into An interesting union let's say select and I take author from my original T. I union this with Select author from my M's And now I basically have them put together But I actually want to preserve which type of author they are so I'm going to add a category here so I say the first one are author as Category and the second type of author is a mention whoops typing correctly as Category Now this gives me a union between these two and now I can do an interesting group eyes for example So that's my resulting T here and what I do here is is I do a select author and Category and maybe if I might add my at T down here. I'm actually going to get IntelliSense up here as well. I can say count Star as tweet count From T now I need a group by here. So I say group by author comma category and And Voila, I now have basically already done a simple analysis here Inside my script There's a squiggly mark here because I need to use the uppercase as not the lower case Which is a type cost and now let's just add an order by on my output here so I can look at it tweet count descending and basically, I Have my first cut at it. You will notice up here I very easily integrated my C sharp expression into my code without a lot of Wrapping or whatever it just naturally flows into the language I show you in a minute how you can do that with user code as well with using the code behind and Now I can go and submit this Thing inside my this script inside my azure data lake and it will be executing It will be showing me here a nice job interaction It shows me how it prepares queues runs and finalizes the job and it shows me job graphs, etc While it's running. Let me go back to the script and Just show you how I can now Do some refactorization of my code. Let's assume I don't really like to have a lot of C sharp code floating in my script because it doesn't really tell me at one Glance what it's doing. So what I do is I actually use my code behind capabilities to Wrap that into a C sharp function and basically deploy it together and manage it separately in my cs file So I copy this script here and I go over to my code behind file in my code behind file You notice there is a already pre-populated a namespace. So I'm going to add System dot link because I'm going to use the where clause and let's call this instead of you see co-application one Let's call this tweet analysis And then I'm adding obviously public class and Let's call the class Udf's for user defined functions and then in here. I have a static public Function, let's call this get Returning a string called actually an array of string. So it's sequel array of string get mentions it takes a String called tweet and Now it returns Something and what does it return? Well? Oops. It actually returns This expression here now this one here. I have to fix up because this is my use equal type name I need the sequel type at the C sharp type name here and actually now I have my function here get mentions inside This namespace and class So I just can go back here and replace all of this With a reference To my namespace so that's tweet analysis dot Udfs dot get Mansions of tweet and Off we go now. This is exactly the same. I can submit this script as well As you can see while the query is executing once it's running You will see the job graph of the actual execution tree Appearing within the job graph window over here the green boxes are work units that have been completed already and the blue ones are the ones that are running and you see up here the Indication of the input which is the seven files that I had which are CSV files and down here It shows us the output that it will be producing While the script is running. I can show you which Script actually got submitted to the service So if I click the script button down below it is opening me the script and if you look at the script You will notice that because I have this code behind file It actually does register an assembly for me on my behalf and it references the assembly so that the code Insight that code behind file actually becomes available within my use SQL script here And then at the end of course, it does do a footer in which it actually drops the Automatically generated assembly as well. Now. This actually means I can do that myself So in order to do that myself For example, if I want to share the assembly with other people or want to have it available up in my You see cool metadata store. I can go to my solution over here I can add a new Project and this new project now is a class library for you SQL application So I'm adding this and now what I do is I go up here inside my script. I basically say copy Go down to my class inside my C sharp project and paste that over then I can go here and say register assembly And let's assume I registered it in my inside my account Inside the master database. I use the assembly name. Let's say tweet analysis I submit this job so this job will now go and register the assembly I now go back to my script over here The only thing I now have to do is I have to send say at the beginning of my script reference assembly Tweet analysis since we are inside the master database context and I go and Delete The code behind file because I don't need that anymore And now I can submit this job once the assembly has been Registered and I will get exactly the same behavior except for now. Everybody can use my Codes that I have written and shared with my With my users inside the database So as you can see it was very easy to combine the expressive power and and abstraction of parallelization using SQL while extending it with all the Capabilities that you have with a procedural language like C sharp to do custom code and to do your own Processing specific to the data that you are going to analyze So now that you have seen how easy it is to extend you SQL with user code Let me just mention some additional interesting features that you SQL is going to provide you with One is that as you have seen hinted at in the presentation You can actually operate over a set of files with patterns You can also take and create tables that are partitioned based on the data You can have federated queries that not only query the data inside your Azure data lake Files and tables, but you can also query against Azure SQL database Azure SQL data warehouse and SQL server instances running inside in Azure VM We have the standard relational Metadata objects such as views table valued functions procedures that allow you to manage your use SQL code and Modulize it as well. We also offer you the whole range of SQL windowing functions and expressions So you can actually not just do data preparation, but also some form of You SQL based analytics and we offer you the complex types such as map and erase as you have Seen a part of it during the demonstration as well I think you agree with me that you SQL gives you an easy way of processing large amounts of data stored inside the Azure data lake It gives you the ability to unify natively the SQL declarativity that gives you the abstraction so that you don't have to care about Scaling out and paralyzing with the ability to extend it with your custom code Using C sharp and even if you want to order that languages It gives you a unified way of querying structured and unstructured data It gives you the ability to query both over local and remote data sources And it will increase productivity and agility from day one forward for you So I would like to encourage you to go and get an Azure data lake account join the public preview at www.azure.com slash data lake and then once you use it give us some feedback via our survey or at any of the feedback locations that we are offering you Additional resources you can find about Azure data lake on of course Azure data lake Wipeside there are several channel 9 videos that are available that you can follow up in like diving deeper on some of the aspects and use SQL resources are both available on the visual studio blog site On two entries that are kind of covering into more detail what I was presenting you today As well as the use SQL reference documentation and the community and team website You can also reach me at my email alias or our feedback email alias or over Twitter at at Mike does big data Here are additional resources that you can follow up on the connect event and other interesting topics about You SQL Azure data lake visual studio, etc on channel 9 MSDN and other resources I would like to thank you for your attention And I hope to see you on the forums and in other places talking you SQL and exchanging our Experiences on processing big data