 Hey guys I hope you can hear me. I'm gonna try so jet-lagged so excuse me if I start yawning in the middle. So my name is Rita. I'm traveling from San Francisco and my colleagues are and I are here visiting Singapore to work with customers and startups in in this area. So what do I do? I'm an open source engineer so I work a lot with startups and open source communities to mainly understand what are the pain points that you have and block any issues that you you may have when working on open source platforms on Azure or Windows. So if you guys are working with an open source platform and you're having issues with it please let me know afterwards. I will love to get to know what your problems are and see if our team can actually work with you to help you get that unblocked. Cool. So as I said I'm an open source engineer so I just kind of want to understand like how many of you have tried using Azure Insight or HD Insight. Sorry a little bit. Okay and just how about Spark in general. Have you tried running on your laptop? Great awesome. And anybody tried running incremental processing with Spark? Okay awesome. This is gonna be an awesome talk. I was afraid that this is gonna be like oh like hard been there done that so cool. So let me kind of tell you a little bit about this project that our team has been working on. So we partner with the UN to mainly look at you know leveraging social media to see what are some of the areas in Middle East that people may need aid. So if there's like any attacks or if they need any help. So UN is basically leveraging people's tweets and Facebook posts to help identify issues before you know it becomes like really major. So here is where we helped the UN to basically crawl the web and using people's tweets to analyze is this a happy tweet? Is it a sad tweet? Should we be concerned about this? And basically aggregate that. And so this is basically where Spark comes in. So we look at across all the tweets. We analyze using natural language processing. We analyze is this a negative positive tweet. And then we aggregate that. And then we also figure out in a geographic area are people tweeting about this particular keyword. And if that's a trend then this bubble gets bigger. And that's when we start that's when UN looks at this as an alert and and go see if that this is a real concern. So at a high level as I mentioned earlier you know we we developed this in different modules but essentially there is a natural language processor that extracts the actual tweet. And in a little bit I'll actually show you guys what the tweet looks like. I don't understand it because there's another language but we also have a translation engine back in the back. But basically we take the raw data and we put in data put it in storage. But most importantly where we're for the today's talk is where we take the natural language processing we parcel the keywords and then we do some analytics on the keyword. So like aggregating and as I mentioned earlier you you then count the number of keywords and you take out information like location that long and then that's where we put the information in blob storage. So this is all great and we get this data like at a consistent pace. But as anyone who's used Spark in the past this is pretty expensive operation. So if you so in this case we can't we couldn't really leverage Spark streaming because it wasn't like you know it wasn't like constantly we're getting constant data coming in and out where that would have been way more efficient. But in our Spark cluster with you it will be very expensive to basically always constantly trying to process the entire batch. So imagine a lot of tweets if you out if you always have to aggregate all this data and the entire data set is very very expensive. So what we've done instead is what we've done instead is we took the entire data set on a daily or hourly basis because it doesn't come in that that often. So based on that then we summarize the data and we put it in a partial summary then every time we when we process a tiny block of Delta that's when we process that data and then we then merge that information with the results that we've already had. And then that's where the new data set comes in. And then the next time when we get a new data a new tweets then that's when we process the new data set and then the same thing happens again and again. So for today's demo I'm going to show you guys like how it works on my laptop and this is just purely Spark no Azure no Microsoft right. I'm using a Mac. But I also want to show you guys it is really really easy to spin up your own Azure inside cluster. And the reason for that is I don't know about you guys I don't like to like spin up six notes of Spark cluster just for simple tests. So normally what I do is I create the cluster and I just run my job on the cluster so I don't have to worry about it. And I can I will show that in a little bit. And this is just to show you guys how I how I created my cluster because creating the whole cluster takes like 20 minutes. So I don't want you guys to wait. So but basically when you go to Azure you basically say oh I want to create an HCI inside cluster and you can specify what type of what type of cluster you want whether that's for a spark cluster or storm or you know you guys seen all this before and then you can also specify what version you want to run and you can also specify if you want to run on Linux or Windows depending on your requirements. Okay so we're gonna go to demo. Okay so as I mentioned earlier by the way any questions so far? All right as I mentioned earlier this is kind of an example of the data set. So like I said earlier I this is I don't understand this language but but we have we actually run through our translation service that basically translates this into something we can understand we and we can aggregate on. But basically with the raw tweet we then analyze we get a bunch of data like what the keywords or what what section is this in the location and the time and that long information so on so forth. So with this so imagine every you know hour or so we get this type of data coming in. Wow this is really tiny but basically the results that you get in the end so if you guys can see this is really tiny in the output this is kind of what we remember earlier we saw in the in the in the map right you guys saw tiles of like tiles of data. So essentially that's what we want to do is we want to take this data and then we want to identify by the type of keywords and then with that then each each keyword then is broken up by days and then in the days and each JSON file is essentially tiles. So each tile gets a count and each tile has is this a negative or is it positive and so the color that the bubble that you saw like the red was like negative and depending on the count that's how the diameter of the circle. That's a great question UN actually helped us with that so they understood like so so if you see this folder structure like they they were like oh we want this to be defined in like by groups and you can see it's like men people women youth so it's like their their classification for how they want to see the like how they want to classify the tweets. It's arbitrary to me but it makes sense for them yeah and then but I but to your point the most important one is keyword because as you can see here you actually get stuff like attack Benghazi bomb yeah so so basically it you know it's like it's keywords that they often see in real situations and that's what they want to look at. Okay so without further ado so I so this is like running locally so what I've done is essentially I'm I'm telling this thing like oh I want to run it incrementally so as you guys can see what what we've done as I mentioned earlier is is essentially we've created this thing called like previous jobs right so so everything in here is the jobs that we've done in the past and that this is like a summary of all the jobs I've ever ran and every time I get new slices of data then we take that summary and we merge it with the previous summary so here we're just going to take a random one let's say the keyword let's say this one so I'm going to just copy it now you remember this is two right so the next time when I run incremental this should go up it imagines the same data set so obviously in real life it wouldn't be the same data set okay so I'm just gonna run this so what this is doing is it's loading up the previous summary and summarizing merging the current set and then merging the two data sets together the reason why I picked this specific topic is I thought this is like something that you know you will only encounter this is if you don't have constant data streaming but you also care about your cost so okay cool so now if you see okay so that's November 8th 410 so you see that this is like a new folder right so new time stamp and if I pick one of these guys so you see how this this was the this was the data set that I showed you guys previously so now it's at 3 I mean it's a it's a it's a it's a simple way to see that it gets aggregated over time okay great so this is all wonderful I'm a great developer it works on my machine but what if I want to run this in production cool so here is the cluster that I created previously remember I showed you guys a screenshot of how to create it so in that screenshot essentially after you click around you so I specify I want a spark cluster and I want this version of spark and it's running on Linux VMs and then as you can see I have six nodes two heads and four workers so here I can click on this that launches the you guys are seen this before and this is basically the dashboard where you can get to see all your Hadoop features right so here is where it's telling me this is my current spark information so if you come over here this this is like a summary of all those past spark jobs that I ran on this cluster and again what's nice about this is you know this is all open source right like Microsoft didn't write this like this is if you've ever worked with this stuff you can run this on your laptop you can run it anywhere it just happens that Microsoft like enabled us to use this on Azure so as you can see here are the jobs and the summaries and you can tell it if it went through successfully and then here as you can see while it's running you can also see like how many tasks are running and what's the like which node is actually running on remember we have like six nodes right okay so I'm just gonna quickly show you guys how that works so now that my cluster is up and running I basically SSH'd into the one of the master nodes so here I'm in that node and once I'm in that node I basically did get clone I got my code here and then here is a script where I'm specifying if you see here I'm specifying here's my storage account information to like to put my files there because we have input and output right and then I specify the containers where I'm putting these files in and then here is where I'm telling this to this is where I'm telling Spark I want to run this incrementally does that make sense okay okay so once I've done that then I'm just gonna go ahead and run this script and just to show you guys so before I run it as you can see this is from yesterday right this this summary was from yesterday I know this is US time but but this is a run that I did yesterday and then if you look here wow that's so tiny it's okay yeah I mean you guys can see it right yes okay but it's the same concept exactly the files that I had locally it's just same data set right so so this is eight now I'm just gonna go ahead and run this so it's gonna run incrementally but the only difference is now it's running in the cloud I don't have to worry about my laptop going crazy so once this guy is done then you will see let's see so previous as you can see now there are two folders it's going to delete the previous solution now that we have the new one it's still running cool so now as you can see this is like today right if you go in here and we pick one of these data sets cool and now this is the the new aggregated summary with the latest data set cool so I think that was pretty much all I wanted to show wanted to like give you guys like a real-world scenario and that's pretty much it so this is my Twitter and my github account if you guys want to go check it out we will be releasing the code soon yes right it's all JSON right so yeah this is exactly why we want to keep it in a JSON format so that the front-end JavaScript code could just parse the JSON and actually if you if you saw the result set right it's it's all the tiles right so actually the way we we've done it is that all the tiles and the tiles below that they get all the aggregate account so if you zoom into the map it's actually the same count so each JSON each JSON is actually an an aggregate of all the tiles all the children tiles if that makes any sense yeah people in this room can answer this way better than I can I'm only a user so like I can only speak like when I use it it's it's it's a definitely more efficient and the newer way of using Hadoop and it's really good at doing like map and reduce so as you saw earlier right we basically map all the data and then we parse it into chunks so that you know you can have all six notes doing aggregation at the same time so that's what it's good at it's good at parallel processing all the data and then when it's done it aggregates them together but it's more for big data not not machine learning but there is also like programming inside Spark or you use other languages and then you spark tools I use PySpark so as you saw earlier all my scripts were just like it's just spark submit a job right yeah just where so imagine Azure is just like a laptop sitting in the cloud somewhere right yeah yeah so there's nothing specific about Azure in this case there is one thing that is specific is where I store the data so as when I run it here it's in the file system right on my machine but but when I ran it in Azure all my files are in the Azure blob storage this is it keeps growing honestly right now we're we're actually not continuously running because this is not in production right like this is for UN for them to evaluate but so far it's already like we had to do it like every hour as opposed to like constantly because the data is just too much I can't answer that right now yeah I it just keeps growing so whatever I tell you is probably wrong yeah and I think that's also why we decided to go down this route because then you're at a given time you're processing even though it could be a lot or a little but you're just processing the new stuff you don't have to like whereas before we had to process everything and that took hours and so we're like this doesn't make sense so let's only do the new the fresh data like alert UN that there could be a problem so they could get resources and send help yeah so like humanitarian aid like flu shortage or terrorist attacks like they're relying on this thing to help them understand like where could the problem be like more more quantifiable right as opposed to like oh I saw some tweets right I don't know because I didn't work on that part but I get I'll be happy to follow up with you yeah but I think someone it we're using like an open source NLP model I'm happy to follow cuz you know it I'd be meaning to look at that code but never did but remember this is not in production so it's not like we're running it all the time yeah yes well we're storing the remember earlier I had like a diagram an architecture diagram so some of the date we shoved the data in in a table storage the raw data but the keywords that the analytic stuff they're all in a blob storage so those are the files yeah right because because the files are for analytics purpose but the raw data we store it somewhere else like the location stuff we store it somewhere else yeah there's no good answer for that we just treat all the tweets as you know yeah yeah but I don't know