 So, user interaction and social data tell us a great deal about the real world. By understanding the time-dependent nature around the discussions in social media and public web, we're able to identify and even sometimes predict important real-world trends. Today, Chris Off and I, Eric, are here to talk about Project Fortis, which is spirited towards accelerating humanitarian relief, all powered by Kubernetes. Just a brief introduction of myself. My name is Eric Schlegel. I'm a principal engineer for Microsoft that work in a team called the Commercial Software Engineering Team. My passion in terms of the projects I like to work on is how we can use emerging technologies to help accelerate Microsoft's most innovative partners. And I'm Christopher Coole, I'm an engineer on Eric's team. Kind of like you would think of the chronic early adopter, there's just something about working with undocumented software that attracts me. I'm not sure what it is, but I keep doing it and I've been doing it for many, many years. You'll find most of my current work on my GitHub side at the top. Or you can reach me on Twitter, I see the architect. As Eric mentioned, we work on a team of commercial software. So that's a very unique, special place to be. Our charter is about unlocking and accelerating innovation with all sorts of customers. We get to work with small startups, we get to work with large enterprises, we get to work with humanitarian organizations like the United Nations. And then this project will showcase some of that work, this talk was. But showcase some of that work. But we don't just engage with those customers. We do this as an investment, so there's no charge when we engage. We really do this for the community to accelerate the body of knowledge out there. The only string that we have that comes attached with our engagement is that we're able to talk about it, that we contribute our knowledge back into the open source community. So if you find yourself ever with a project that's just a little bit scary, and going into uncharted territory, you can give us a show. Maybe we can find a project that we can work on together. So the project forward of story started a little over a year and a half ago through collaboration with United Nations, specifically a team called OCHA. And this team, this part of the UN is responsible for providing humanitarian relief for those that are in vulnerable need, that are in the midst of humanitarian crises. And trying to provide that aid as quickly as possible. So to focus on a targeted scenario that they had at the time was the post-Gaddafi situation, the refugee displacement situation in Libya. And their main focus at that point was trying to get more insight around refugee displacement, terrorist attacks, famine that was occurring in that area. So at the time, their current process for trying to assess what the humanitarian situation was at the time was trying to manually grab, at the time was trying to manually collect information related to a specific target set of topics in that area. So they were collecting Twitter feeds, Facebook feeds, public news sites, listening to radio broadcasts. Now this process was happening every single day by feet on the ground. And with this manual nature process, the aid plans became imprecise because the turnaround time it took by the time that you actually collected this information and were able to assess the impact that was going on on the ground was a huge challenge at the time. So the other obvious challenge is the amount of data sources that had to be manually curated. And areas like Benghazi, the availability is quite inaccessible because of the tumultuous situation that was happening there. So that team was operating out of Cairo because it was the closest place that they could access. So with Project Fortis, the main goal at the time was how do we accelerate the aid planning process? Well, one is to sit down with the operations team in Cairo and figure out what types of data sources they were collecting. So what we did was we figured, okay, if we were to listen to all those data streams, Twitter, public web, radio, and being able to collect all that information, and then being able to use a variety of different machine learning and NLP technologies and approaches to be able to do topical extraction off of that data, as well as to be able to identify the location that people are talking about, extract the sentiment from the conversation, be able to extract what organization is being mentioned, what person, what event. So the idea was that we would want to be able to identify atypical increases of mentions within those channels of information. And then being able to provide that level of insight to the aid planners so they can make an appropriate decision in terms of where they want to send their aid workers. Timing was absolutely critical for our use case. So we wanted to build a data processing pipeline that was as near real time as possible. With Fortis, we were able to do a turnaround time in less than 15 seconds in terms of when an event is actually tweeted to the time that we were able to process it, aggregate it, and be able to provide it on the dashboard. So what we wanted to do is build something that's repeatable, something that's adaptable to other scenarios. So we then took the same exact Fortis solution and we repurposed it through a collaboration with Amia University in WHO to help identify and predict where Zika virus was spreading. So the way we did that was we took a look at weather forecast data to try to predict where mosquitoes were migrating towards, to then inference where whether those regions, those areas, were high risk zones. Then layering in the social media data that we currently have on top of that weather pattern data, we're able to then assess if people are talking about fever or Zika-like symptoms and there's a high level of confidence that Zika is in this place. And then you could also layer in things like interventions where you have health workers that are actually in there trying to remediate the disease. So just to talk a bit about the functional architecture of Fortis. So we built streaming connectors all based off Spark where we started collecting, we built real-time streams off of things like Twitter, Facebook, even Instagram, and RSS feeds. And then also bring your own data source with Kafka so that if there was custom data you could just pump it into Kafka and it would feed into the pipeline. And the other really cool thing we did was we actually also tapped into radio feeds. So taking real-time audio streams that are coming in, converting that to text, and then being able to do any extraction off of that. So our Kubernetes cluster is running on Azure off of Azure Container Services, and we're using Spark Streaming. So the data stream will come in. We would then distribute that data through Spark RDD partitioning. Then based on the events that people are talking about, we would then run some any extractors to try to figure out mentioned places, even down to a street level or corner or a district or administrative boundary. And then we would tag the event to that place. So we then took a time series representation saying, okay, take a snapshot per minute, per hour, per day, per week, per month, per quarter of figuring out what was the context of the discussion across all these verticals. This way that we, once we index it by geotiles, we're able to get both a spatial and a time-dependent view on a set of topics and entities. While using a variety of different cognitive services for speech analytics and text analytics for sentiment, we then use Lucine to do the entity recognizer. So we built a whole bunch of analyzers within Spark, and we filter the content based on a geofence that the user defined, and then we push the data into Cassandra. That data would then be served on an administrative dashboard, on a user dashboard, and the backend APIs would be all based on GraphQL. And the reason why we did that is because we want to give the end user the flexibility to query the data however they want. So our solution was built on Kubernetes. We wanted to run Spark on Kubernetes, and the main driver for why we wanted to do that is because we wanted to increase the resource utilization of all of our pods. We didn't want to have AVM just stood up to run a single Spark worker. We wanted to be able to maximize all the nodes we had within our cluster. We wanted to be able to decrease our operational cost as much as possible. Simplified employment model as well. So one of the great advancements that's happening within the Spark and Kubernetes space is a project called Spark Native. We have a bunch of open source contributors that originated from BlueBurg and Palantir and a bunch of other companies as well. And what this is, it's a framework that allows you, it's a plug-in framework that they fork Spark so that it's a backend schedule where they're able to talk directly to Kubernetes and schedule pods and elastically scale out the number of workers, depending on how that Spark job is progressing. So they provide a shuffle service where they're able to automatically scale that out and they provide a staging server where you're able to containerize your underlying data sources, your underlying Spark files. So the great thing about that is when you submit your Spark job, you specify your Kubernetes endpoint and then Kubernetes acts as your driver. So the whole concept of the master pod within Spark completely goes away. We also, the other big driver for using Kubernetes for Spark was Helm. So being able to have one single convention for deploying our stack as well as Deus for serving GraphQL and our React endpoint. The three-mind developer experience, the Kube CTL, so one CLI to monitor our entire cluster in high availability with this free dust from having to use ZooKeeper and minimize our operational management stack. And then the ability to elastically scale out our workers via the Kube CTL and Max replicas. And then also simplified our DevOps with being able to leverage tools like Prometheus and Spark and the Kubernetes dashboard to monitor and log and to raise events in certain scenarios. So just to talk through a bit of the deployment model, all the code that with Fortis is all open source. So the way that we deploy our Spark job is once you cut a new release on GitHub, we, that kicks off a Travis CI build, we then use simple build tool assembly to generate the FATJAR, we take that FATJAR and then we post it to Blob Storage. And Chris is also gonna talk more about the Kubernetes deployment. Does it come through now? Excellent, cool. So for the Kubernetes part of the conversation, first of all, working on a project like this that's deployed by the United Nations, I think is a real way for us to make a difference. So it's quite a piece of pride that we get to present here. On the Kubernetes side, we've worked on on the deployment quite a bit. We started with the Helm charts that you find on the Kubernetes repo. But we made some interesting modifications to meet our needs. As Eric mentioned, density and just research utilization was one of our goals. So we're deploying two namespaces into a single cluster. We have one namespace for Spark, we have one namespace for Cassandra. We modified the templates a little bit on the Spark side. We upgraded to the latest version 2.2, we're eyeing 2.3 at the moment. So we have Spark streaming and all the latest features available. We also made some changes in order to be able to kick off jobs as we launched the applications. So if you go to our catalyst code repo, you both find a chart that has those modifications. But from a Spark perspective, we run a single master that we kick off as a deployment, then we have a stateful set of worker nodes, yeah, executors that run the Spark jobs. We also have the Cassandra nodes running in the cluster. And with Cassandra, we optimized the chart mostly with an eye on high availability within Kubernetes and optimizing that for how things run in Azure. We're also using stateful sets here because the pod identity becomes very interesting when you try to discover services and we'll talk about that in a second. Who's been in the keynote this morning? Everybody? Everybody saw Kelsey? Kelsey was asking about spinning up Kubernetes jobs, Kubernetes clusters. He made it sound like a big deal because we use the Azure Container Service. And for the Azure Container Service, spinning up a cluster is kind of a one-command line kind of kind of a deal. You type a bunch of letters and you wait five minutes and you have a cluster. So the Azure Container Service is what we use and we actually use the derivative of that. It's a way to spin up a cluster, but that is your cluster. It's not managed by Microsoft. You have some control over the configuration. It has, let's say, reasonable defaults for networking, for the distros, and so on. But if you have advanced configuration needs like some of our customers do, for example, they want to run a GPU-enabled VM or they have a very special distro that they want to run on. They have a, I don't know, a special version of CoreOS, for example. We also have what we call the ACS engine that is a way to run or provision very advanced configurations of Kubernetes clusters all the way down to controlling the podciders, the service networks and so on. So you get to customize the address space, which we also did for something we're going to talk about in a second. You may have heard just a little plug for our employer. We have the AKS, the Azure Container Service, with a K, not with a C. This is a fully managed service where Microsoft will maintain the operating system, the Kubernetes runtime, and they won't even charge you for the masters. So you just pay for the agent nodes. So much for that. So Kubernetes for us is just, yeah. Standing up the cluster is the smallest of our worries. We did some extensive research on how we configured the cluster for my networking and benchmarking perspective. And wow, this is reformatted again. The benchmark that we're running is taking activity data. So data about people going on runs and bicycle trails and stuff like that and produces heat maps from that. So here on the slide you see heat maps for people going running or bicycling in Austin. It's a rather large data set that we use to benchmark and we have a configuration that exercises the network and avoids resource contention. So we have a Kubernetes cluster where we run Spark and we have virtual machines where we run Cassandra. Of course, they're connected via the network. And the Azure CNI plugin that was fairly new throughout our project, it came out sometime this summer. We wanted to compare that to Calico, which many of our customers were using. So we still get a good performance advantage out of the Azure CNI of about 10, 15%, which is kind of what people expect when you compare native networking to Calico. So we were quite happy that this was indeed the result that we expected going into the benchmark. So our cluster is now configured for running Azure CNI. We also did quite a bit of work around high availability, specifically with regards to the Cassandra setup. And you're probably familiar with the concept of fault domains where your cloud provider will make sure that your virtual machines are truly highly available. Because when we provision for high availability, we typically say, okay, we're gonna provision more than one, maybe two or three virtual machines, and we hope that things go right. Now, there's a little bit of trickery to that because if you imagine you have three virtual machines and they all run in the same rack and that one rack has a problem like somebody trips over the power cord, hopefully not. Or the top of racks which goes bad, then all three VMs would be unavailable and your application would essentially be not there. The same is true for containers. The way we're placing containers in our cluster, I had this conversation very recently with an engineer over at Mesosphere. Just because you're deploying more than one replica into a container doesn't necessarily mean that you're highly available into a cluster. Doesn't necessarily mean you're highly available because even if you have three replicas of, let's say Cassandra running, if all of those replicas were running in the same rack and that rack has a problem, then you still have an issue and a high availability setup wasn't as good as it should have been. So what we did, Cassandra has some built in high availability features that when you tell Cassandra where the racks are and where the data centers are, it will automatically place data across the ring so that even if a single node has a failure, all the data is replicated across the ring. We also had some extra setup in the chart that we made sure that we only have one Cassandra pod running on each node, just to make sure we don't have duplication and we truly have the key space replicated around the entire cluster multiple times. The feature in Cassandra is the gossiping property file snitch. I think that's a great word that we configured. If you running with a W configured in a custom container image that we have out on Docker Hub right now. EC2 has their own snitch and we should probably write our own at some point but for Azure this works actually quite well. So this enabled us to do a multi-Azure data center setup for high availability. We can now take that container because it has the data center information and the rack information baked into each Cassandra image. I can take that and deploy that into two data centers. So I bring up two data centers, bring two Kubernetes clusters. I will then provision a single Cassandra ring across those with a single line of script. So that's actually kind of, we were quite proud to show that. But not everybody wants two Azure data centers. We have some customers that say, okay, we have actually our own data center that we would like to do high availability and can you stand up a cluster that would be a hybrid Cassandra ring over our on-prem Cassandra that we already have and extend that into the cloud. And we can certainly do that with the networking connectivity features that we have between a cloud provider and an Azure. So in this case, it would be Azure ExpressRoute that would connect the on-prem environment and the cloud environment to form a single network and then we can bring up Cassandra on the two clouds. The last one we thought we'd show is our work in progress. And we learned in the keynote this morning that multi-cloud is a big deal. So we were looking at it as like, yeah, that's actually what we've been working on. So we have been experimenting with a multi-cloud setup for Cassandra, taking the half the cluster in an Azure and half the cluster in another cloud, AWS, connected those two via VPN connection and then bootstrapped our Cassandra cluster just to see how it works. So the good news is, the very good news is I can actually show this because it did work. Let's see if I can show this. So here we have terminals, oh, why not? Here we don't have. Here we have SSH connections into two different virtual machines. One is running in Azure, the left one. The right one is running in AWS. Just to make sure you believe me, I can actually go to, I can go to the metadata service and this looks very much like an Azure VM that's running in our West US region. I can do the same over here. And you see this is truly this type. Yes, so this is truly an AWS VM. The two are connected. So if you actually, if you can, you see the IP address over here. I can actually say something like ping, 72, 50, 131. And surprisingly enough, even though we're connecting an Azure data center to an AWS data center, we have a latency in the tens of milliseconds. So that we feel quite confident it's good enough to run a distributed cluster. So if I look here, I actually have Cassandra deployed. It's the same Helm chart that I have on my repo. And I actually do have a Cassandra pod running over here. I do have, I do also have Cassandra pods running over here. And when I say something like, Cassandra, when I check the status of my cluster, you'll see I do have indeed, I do have indeed a data center, a two data center cluster, one here in AWS and one and one half of that in our Azure data center. So now I have a single distributed Cassandra ring across two data centers. See if this comes back. And so you may ask how we did this in the few years federation. The answer is not yet currently because federation is still a little bit early in its life cycle. And we try to do something that that was stable enough for customers to experiment with. We actually are looking for customers that would like to work with us on setups like that. So if you're interested in exploring multi cluster configurations or multi data center configurations, multi cloud configurations, or know somebody that does we would absolutely love to work with you. And as the parting slides, so to say, everything we do is open and everything we do is on GitHub. So everything from the ingestion pipeline with Spark and Cassandra, that is yours to stand up, complete with the GraphQL based front end. We have of course the multi data center, multi cloud configurations documented. We have the cross cloud networking documented and everything else. We also own a site on Microsoft.com called the developer blog. So if you go to Microsoft.com slash developer blog, you will find lots of information about other projects that we're doing. So not just the humanitarian, but other work that we've done within the areas of artificial intelligence, computer vision, blockchain, machine learning. I mean, pretty much anything that's interesting and a little bit out there right now is stuff that we work on. So, yes, the virtual guide dog. So using mixed reality to as a virtual guide dog is another project that we think is very interesting and has a very nice humanitarian aspect to it. I think we have five more minutes for questions. So if you have any, please ask away. No question, but of course. Sure, sure. So we used, there's a speech SDK that comes with cognitive services. We built a protocol on top of that where we were doing real time speech to text. So the API where it does speech to text, so pretty much the Cortana API, ultimately. So we integrated that in real time within Spark. So we built a whole Spark streaming connector around that. The other one we used was text analytics for sentiment analysis and then computer vision. So what we did with that was for Instagram posts, whether it's pictures or video, we would do object detection off of what is in the video. So you can imagine an Isis flag or AK-47. So being able to tag that image as a piece of content that we would display in Fortis. Sure. So what we did was we have a copy of OpenStreetMap and we use Lucene to identify what our mention place is. So once we have a mention place, we then geotag that event to the shape of that place. So within our Spark job, we then partition that event to all those different tiles from say zoom level eight to 18. So we're basically able to get a time and space representation of that event. And we store it in Cassandra as a spatial dataset, basically based on a tile ID. And that's what drives the heat map. Yeah, so we were using stateful sets for our pods, for the worker nodes. As part of the Helm chart that we set up, you can set that up with MinMax replicas based on CPU limits. But with Spark Native, that's already baked into the Spark Native engine itself and their schedule service. So what they do is they take a look at all the tasks that have to basically get scheduled out. And they take a look at what the Kubernetes load is and then they figure out exactly if they have to create any more worker node pods. But Spark Native, according to the author, Spark Native, they're trying to get that merged back into the upstream in Spark 2.3. I don't know when that's gonna be about. So that Kubernetes will be naturally integrated with Spark. Great, well thank you all for attending.