 I'm Parker Abercrombie and I'm a software engineer at NASA's Jet Propulsion Lab. Today I'm going to be talking about how NASA uses the cloud to enable virtual reality Mars. So the project that I'm going to be talking about is called OnSight. And OnSight was developed as a collaboration between JPL and Microsoft. And it allows scientists and engineers to work on Mars through the power of virtual reality. So the way this works is the user puts on a virtual reality headset and we use the HoloLens device from Microsoft and then they see Mars around them in their office. And they can walk around and explore the scene as if they were actually there. So the reason that we do this beyond the just as fun is that you can get a different sense of the nature and the scale of the Martian terrain by looking at it in an immersive 3D experience that you can get by looking at a 2D image on your computer screen and then trying to reconstruct the 3D scene in your head. So our users are scientists who work with Curiosity Mars rover and they're mostly geologists. And when geologists study a place on Earth they actually go out in the field and they walk around and look at the rocks. So we're trying to give them as close to that experience as possible for Mars using the technology that we have available today. So to make this experience possible we needed to create this virtual reality scene of Mars. But the rover moves every day and we wanted this tool to be useful operationally. So we actually needed to not just do this once but we needed a way to create these scenes easily and automatically as the rover moves and new imagery is downlinked. So the on-site team built a custom image processing pipeline that takes stereo images from the Curiosity rover and builds a 3D reconstruction of the train around the rover. And then we build a cloud architecture to let us run this automatically and the cloud as soon as new data comes down. So this is a cloud talk today. I'm not going to say much about virtual reality and I'm not going to get into much detail about image processing. But I'm going to talk about how we've used cloud computing and open source technology to enable us to run this process automatically when new imagery comes down and push that out to our users. So by the end of my talk I hope you have learned a little bit about how we've used the cloud to solve our problems. Hopefully you'll get some ideas that might apply to your own problem domains. And I hope you'll be excited about space exploration. So the data that we work with comes from this instrument Curiosity. Curiosity is a rover on Mars. She landed in August of 2012. So she's been on Mars for about three years now. And to give you a sense of scale, this thing is about the size of a small Jeep. So it's pretty big. Curiosity had a lot of different instruments on board. The ones that we work with primarily are the stereo cameras on the rover mast here. You can actually kind of see the two eyes of the stereo mass cam there. So as the rover drives it takes these pictures and it sends back the pictures to Earth. So we take them and we process them into this. So what we're looking at here is a 3D reconstruction of a scene of Mars along Curiosity's drive. And I want to point out that everything you're seeing here is real. This is all real imagery sent back by the rover. There's no artistic retouching. And this is produced completely automatically with no human in the loop. In the front here you can actually see a little bit of the rover's shadow that was captured in the imagery because the sun was behind the rover when that picture was taken. If anyone follows the Mars Science Laboratory mission, this mountain coming up in the background is Mount Sharp. This is Curiosity's final destination. And right now the rover is exploring dark dunes that will be coming into scene just in a second here. There's this band of dark dunes at the base of Mount Sharp and that's where the rover is today. So this is where we want to get. We want to take the stereo images and make a scene that looks like this. But before we do that, let's back up a step. So this is Mars as seen from orbit. The Curiosity rover is exploring Gale Crater, which is the yellow dot here. I'll zoom in a little bit. So the blue path here is Curiosity's traverse. So everywhere that you see the blue path is somewhere that the rover has driven or stopped and taken images. So what we're looking at here is Mars as seen from above from an orbiter. And that instrument is the Mars or Conscience Orbiter. It has a number of cameras and sensors on board that produce these orbital maps of Mars. And this imagery is about a quarter meter of a pixel, which is fairly good as orbital imagery goes, but if you're going to put it in virtual reality and stand on it, a quarter meter pixel is pretty big. So as Curiosity drives, you take these images and then we take those and we build these three scenes. And we need to do this for basically everywhere along the path. So everywhere the rover stops and takes pictures, we consider a scene and we'll build a reconstruction of that part of Mars. At some point, we would like to link these all together into one super scene of all of Mars, but we're a little bit of ways from that today. So I'm not going to get too deep into image processing in this talk, but I'll give you at least the high level version of what we do. So we take the stereo images and using a stereo correlation algorithm, we derive a range from the camera position to each point on the terrain. And from those ranges, we can create a point cloud. And from that point cloud, we can do a surface reconstruction that gives us a mesh geometry for the scene. And then we can take the images from the rover and paint those back on top of the mesh to get the fully textured mesh. And you can see a little bit of that process here where the background is kind of hard to see in the projector, but you see only the wire frame mesh geometry of the scene. And then as you get closer in, you can see how the mesh and the texture starts to interact until you have the final product appear next to the rover. So when we're doing this, there's a couple of different types of imagery that we need to combine, because we don't have high-res color imagery of all of Mars, unfortunately, especially places where the rover is just arriving. So the main types of images that we work with are black and white images from the rover's navigation cameras, which give fairly low-resolution grayscale imagery in color mass cam images. So this is the high-resolution science camera. And where we don't have either of these types of imageries, then we'll fall back to orbital imagery from the Mars or Constance orbiter. So for every part of the mesh, we'll have some type of imagery and some is better than others. So there's an element of sensor fusion and how to stitch these things together in a way that you're using the best possible data for each part of the mesh. Sir? It's an in-house pipeline. Yeah. So there's an element of sensor fusion in combining these things into a good-looking final product. And for the purposes of my talk today, I'm going to treat that as kind of a magic black box where images come in, magic happens, and then this textured mesh comes out the other end. And then once we have that textured mesh, then we can load it onto the on-site software running on the HoloLens and look at it in virtual reality. So we've been running this pipeline for about a year now. We've processed several hundred scenes all along the rover's drive. The size of the scenes varies quite a bit, depending on how much input imagery is available. In a place that the rover has been exploring for a while, where we have a lot of images, we might have several thousand. In a place that we just arrived, we may only have a handful of images. A typical scene is about a thousand images or about five gigabytes of data. And over the course of our processing, we'll crunch that into a couple hundred megabytes of mesh, and that process takes a couple hours running on a single node. So when we started this project, we developed the software just on our development workstations, and we would do kind of the simple thing that you would expect. You'd expect your input files to be in one directory, and you'd put your output files in a separate directory. And when we went to run it, we'd copy the appropriate input files to directory A, hit go, wait a couple hours, and then grab the results from directory B and then put them where they actually needed to be. And that worked pretty well for development, but it obviously doesn't scale very well when you move into operations. So that's what led us to port this into the cloud. So at the high level, we have a batch processing problem where data comes in, we need to detect when that data is available, we need to pull it down, run this long resource-intensive task that produces some output, we need to put that output somewhere where our users can get to it. We want to be able to see what's happening as the system is doing all of this and start new builds and stop builds that are running and see the status of what's currently running. And we have a bursty workload where we get downlink from Mars on a roughly daily basis. So downlink happens, we need to spring in action, grab the new images and build the scene, and then we kind of go back to idle to tell the next downlinks. We don't want a lot of expensive computing resources running, not doing anything and costing money. So we built a system that we're running in Amazon's web service cloud to solve this problem. And here are some of the Amazon services and open source technologies that we're using in the system. We use the Jenkins Continuous Integration System, both to compile and run our image processing code and tests, but also to actually run the image processing jobs so we can treat them as if they were compilation jobs. We use Ansible to deploy and configure our Linux servers. We use the Loopback Node.js framework to expose a data API on top of our database. And we use Loopback, sorry, AngularJS and Bootstrap to create a dashboard view of the system. And I'll go into a little more detail on how we're using all of these. So here's the high-level schematic view of the system. Things really start on Mars where the curiosity sends data down to Earth. It goes into the mission data system and it's a catalog in that system. And that's where we pick it up. Now, our system falls into kind of three main pieces. We have a build manager whose job is to find out when new data is available, index that data into our database in a quest that builds start. We have a build cluster that has a master node and a fleet of worker nodes that actually do the terrain builds. And we have a distribution system, which stores the results and pushes them out to our end users. So first thing happens, data comes down from Mars to Earth. The build manager periodically pulls for new data, and as soon as it finds some, it indexes that data into our database. Then it requests that the build cluster start a new build. The build master Jenkins will find a worker to run the build, creating a new one if necessary. That worker pulls the data that it needs to do its job from the mission data system, runs the job, and then pushes the results out to RS3 bucket. Then it notifies the build manager that new build is completed, and then our users, the next time they launch on-site, will pull down the new data and see the latest location on Mars. So here's how Amazon's web services fall out in our infrastructure. Imagine most people in this audience are probably familiar with these, but I'll very briefly introduce them in case anyone's not. In the middle column here, these are all EC2 instances, so these are all virtual computers running in the cloud. The build manager and Jenkins are kind of medium-sized instances, and these are long-running, they're pretty much always available. The workers are the beefy machines that actually do most of the image processing, and these we treat as disposable resources. We create and destroy those all the time based on our workload. We use the SQS, the simple queue service, to communicate between the build cluster and the build manager and within the build cluster itself. So we have a couple of queues set up for different types of communication. We use the relational database service. This is Amazon's relational database as a service. It's kind of a veneer over a relational database to store information about available source data and completed terrain builds. We use S3 to store the results of our builds, and CloudFront is Amazon's content distribution network, so CloudFront takes the files that we've put in S3 and then pushes them out to data centers that are geographically closer to our end users so that they can experience faster download times. So this being a Linux conference, I'll also talk about operating systems. Our build manager and Jenkins, we run in Ubuntu Linux, and our worker nodes we actually run on Windows. So it's a heterogeneous system where we run both Linux and Windows machines. And to communicate between those machines, Jenkins helps us. Jenkins is pretty good at sending jobs out to the Windows machines. And for other data, our usual solution is to use a simple queue if it's a small amount of data or to push results up to S3 and then pull them down on their end. As long as it's not a terribly huge amount of data and it's all within Amazon's data center, it's pretty easy and it works. So I'm going to go through each part of the system in a little more detail, and I'll start with the build manager. So the build manager's job is to discover when new data is available, orchestrate the terrain reconstruction jobs, and present a dashboard view of the system. It also is kind of the interface between the rest of the system and the database. So it exposes a REST API. We use the loopback as a Node.js framework to build a REST API on top of our database, and the build manager exposes that to the rest of the system. So nothing else in the system needs to know anything about the database. We also have a dashboard view that we built using AngularJS and Bootstrap, and I'll demo that now. So this is our build manager dashboard. So this is created with AngularJS and Bootstrap. At the top we have some summary statistics about the number of scenes we've built and the number of builds of those scenes. I can see the status of the cluster at the time I recorded this. There were two nodes running, three allocated, and one build in the queue that hadn't been allocated to a node yet. Moving down, this chart here shows me the trend in time over the past month of different builds. As I mouse over, I can see a little bit of information about each of those builds. The table on the side shows me the status of the build queue. So when I recorded this, there were two builds running. You probably can't read that, but this is the time that they've been running. Moving down, I have a table that shows me recent build failures. I'm happy to say that there aren't any in the last two months. So this table tells me a little bit about each of these builds. I can click through to view the log and see why it failed, and we build a feature to acknowledge failures so that we can track if they were something that we had actually looked at and determined it was not a problem or fixed or something that still needed to be done. It shows me all of the builds that have completed in the system. So each row in the table represents a reconstruction of one place on Mars and gives me some information about that build, where it was centered, if it succeeded. I can enable or disable the build, which means if it looks like something broke, then I'll disable that so our users don't see it. Fortunately, that doesn't happen very often, but it's handy when it does happen. We're going to build, and we generate these preview products as part of the reconstruction process. So I can click into any of these and preview what actually was built and see how it looks without loading into a 3D viewer. In addition to the panorama previews, we also generate these fly-through movies that show a little bit of the 3D reconstruction. In these video preview products, it's just a camera that pans around in the actual experience you can walk around. The question for the recording is what happens if you walk toward an area where you have no imagery? We have imagery everywhere, but it may be orbital imagery, so it'll just get low resolution. To store the data about source images and completed builds, we use Amazon's relational database service. The reason we use RDS is that we get automatic snapshots. We're already running in Amazon's cloud. So automatic snapshots that can be very easily restored is a nice thing to have. It doesn't really take any maintenance. As the back end of this, we use a MySQL database. We did consider using a NoSQL solution when we were setting up the system and decided not to because databases aren't really our problem in this application. The amount of data that we're tracking is relatively small and is pretty easily handled with a single traditional relational database. On top of our database layer, we built a REST API, as I mentioned earlier, using Loopback. Loopback makes it easy to add the standard REST API and features on top of a data source, and it plugs into different database back ends. This gives us a little bit of database independence. If we do decide to switch out MySQL for different technology later, then the changes are isolated to just this one layer. Other parts of our system don't have to query tables in MySQL. They can just use these REST endpoints to get them post-train builds or search with filters. Loopback lets you add some logic in JavaScript as hooks that run where your tables are updated, so you can add business logic at that level. Another Amazon service that we use in the Build Manager is CloudWatch Log Management. This is a way of pushing logs from your servers up into Amazon's cloud so you can see them and process them from the AWS console. I think this is nice if something's going on with my server, I can log into the AWS console and look at logs instead of having SSH in that box and then trying to remember where the log file is. Once they're in AWS AWS, they can be filtered. They can push to elastic search services if you want to do more depth analysis. You can set up alarms based on keywords that appear in your logs such as errors. This has worked out pretty well for us. It's pretty easy to set up on Linux. It's just a service that you install on some scripts from Amazon. I know that it can be done on Windows. I've not done it myself on Windows, so I can't speak to that. The next piece of the system is the Build Cluster. The Build Cluster has a master node which is running Jenkins. Jenkins is a continuous integration system. It's similar to tools such as bamboo and cruise control. The typical use case for Jenkins is you would have it monitor your source control repo and when code is committed, Jenkins will make the results of that build available. It turns out that compiler code is actually not that much different than reconstructing Mars terrain. You have some input images that you run some executable on and it produces some output that you need to put somewhere. We use Jenkins to both compile our code for continuous integration and also to actually run the image processing jobs. What Jenkins gives us is the ability to manage a fleet of worker nodes, configure scripts that will run on those nodes, and then Jenkins handles keeping track of the build queue and parsing that work out to the nodes in the cluster. This is the Jenkins interface. If you use Jenkins before, I think this will look very familiar to you. Over on the side we have a view of the build queue. At the time I took this, there was one staging build in the queue. Down below we have the status of the build cluster. The master node was running a job called manage nodes, which I'll talk about in a second. We have a couple of nodes offline and then that third one is running, I think, a production build. There's a couple more down below. This table on the right shows the job descriptions in Jenkins. Each of these is basically a script that does a certain thing. The top one is build scene staging. That does a staging build for a certain scene. There's one that builds a production scene. There's one that builds the preview products and a couple other miscellaneous ones. We also have Jenkins jobs configured that can create and destroy instances in EC2 and manage the fleet of worker nodes. If I want to see information about a particular build in Jenkins, I can click through to see details on that build. I can click to see the console output, either after the build is completed or while it's running, if I want to see what stage in the process it's at. Jenkins tells me that this build was started 59 minutes ago and it's been executing on this host and it gives me an estimate of when the build will finish. And it gives me some information about what version of the code this was running. So this was running the production code at that certain get hash. So Jenkins out of the box gives you the ability to keep track of worker nodes and parcel work out to those nodes. What it doesn't do is it dynamically create and destroy the cloud instances based on workload. So we extended Jenkins with some custom scripts to make that happen. So the way that we do this is we have a periodic task called manage nodes that runs every couple minutes on the Jenkins master. And it looks at the size of the work queue and it looks at the size of the build cluster. And if those two are too far out of whack, it will try to equalize them. So if it sees that there's more work in the cluster than there are nodes available to do it, then it will spin up a new node and then add that node to the cluster. So if the work queue is empty and there's a bunch of vital resources around, it will shut those nodes down. So the system scales within some parameters that we control to adapt the amount of work that is actually needed at the time. So there's a little kind of Jenkins specific trick that we use in this system. So our main job is this kind of large monolithic image processing pipeline that takes a lot of resources and runs on beefy computers. So we only want to run one instance of this task on a node at a time. So we don't want the Jenkins master to ever try to run two of these on the same node at the same time. But we do find it convenient in our Jenkins scripts to be able to call out to sub-jobs that don't take a lot of resources. So the way that we set this up so the Jenkins master will do what we want is Jenkins has this concept of execution slots on each of your nodes. So you can allocate more slots to nodes that are more powerful. So we set all of our nodes to have seven execution slots and then we set three slots for lighter weight jobs that the main one will call out to. So that will make the Jenkins master never try to allocate two big jobs to the same node. Okay, to summarize our use of Jenkins we have periodic tasks that run to manage the scale of the worker cluster and bring nodes online and offline depending on how much work is actually happening at the time. We use groovy scripts to automate Jenkins and we also call out to some command line scripts that use the Jenkins Resta APIs to view the size of the build cluster and status of the system and submit jobs. And we use tags to mark different types of nodes. So we maintain separate development of production environments and we tag some of our workers as reserve for production and others as reserve for developments. We know that there will always be production nodes available when we need them. So the next part of the system are the worker nodes themselves. These are the nodes that actually run the image processing code. And these are a little bit of a different beast because these are GPU enabled EC2 instances that run Windows 2012. So our image processing pipeline was implemented in .NET and it runs on Windows. So when we made the move to the cloud we just chose to keep it in that environment. So the way that we manage these is we take a machine image for Windows 2012 and then we install all of the dependent software that we need onto that image. So this is not actually our image processing code but it's all the dependencies of that code. So it's the right version of .NET and the right version of the Jenkins jars and the right versions of the different image processing tools that we call out to and a handful of other things. And then we bake that into our own AMI. And as our needs change we revision that and update it. We expect that to happen infrequently because it's kind of a pain to re-bake it. So we try to only put things that we expect to rev fairly infrequently onto this machine image. So these are GPU enabled instances and in Amazon's cloud there's two offerings in the GPU line. So they're both in the the G2 family and there's the XX large images. So the XX large has one GPU and the virtual CPU is 15 gigs of RAM and the XX large is basically four of those put together. So it has four GPUs and 32 CPUs and 30 gigs of RAM and it's really kind of a beast of a machine. We use both of these for different things. So the life cycle of a worker is it comes online, it registers itself with the Jenkins master and then it waits around until new work comes in. When work is available that node will first pull our code from get and build the version of the pipeline that it's going to run. Then it will pull the source data that it needs from the mission data system. Then it will run the image processing pipeline and then assuming everything goes successfully it will push the results to the onsite S3 bucket. And then it will notify the build manager that a new build is available. And at that point it can go back to waiting for more work at which point the Jenkins master will either give it another job to do or it will shut it down if there's no work. In setting up the system there were a couple of things that we learned. One thing that we like to do is to keep the workers basically as dumb as possible. So we created and destroyed these things all the time. We do have to install a lot of software on them and it's an annual process to do that. So we bake as much as we can and we expect to not change frequently into the Amazon machine image. And then we try to keep all of the rest of our code in get or other places that the worker can pull from. So we try to keep the workers in a state where they can be bonded very quickly and pull down the resources they need. Baking things in the AMI is necessary for us to get the workers to spin up quickly but a little bit of a pain to manage. And the other thing that we found with these machines is using the GPU on the cloud it can be a little bit troublesome. We had to fiddle around with this quite a bit to make it work. We would write code on our desktops and then it would not work fine on the cloud. We did get it to work though, so it does work. So there's another thing that we use. This is another of Amazon's offerings. EC2 has two types of instances. On-demand instances are the normal ones. They pay for by the hour and you pay a fixed rate. So there's also a spot instance market. So spot instances are a little different and you bid what you want to pay and if there's excess capacity in the system and no one outbids you, then you get your instance for that price. Which is great because you can get instances at much less than market value. But there's a catch of course and the catch is that your instance can be terminated at any time. But actually for a lot of work that's okay. So obviously we don't use this for anything time sensitive or production critical. But for a lot of work it's actually alright. When we get it down like we need to build that new scene for the place the rover just arrived as quickly as possible. So spot instances are a bad fit for that case. But when we get new imagery for whether the rover was a couple days ago or for when we're running builds in our staging or development environments those aren't really that time critical. And if one of the instances gets terminated we'll just try again in a couple hours. So the question is if a spot instance is terminated or from the beginning. And the answer is yes you can restart from the middle if you program it that way. In our case we actually choose to keep things simple and just restart from the beginning. In practice our instances don't get shut down that often and having said that there were actually two that were killed in the last day. But other than that they usually don't get shut down that often. But what we use this for is we bid on the 8x large instances for the price of the 2x large instances. Usually we're able to get 4x the processing power for the same price as the small instances. So spot instances have worked out very well for us. So now I'll get into storage and distribution. This is by far the simplest part of our system. And it's really as simple as we put our builds into S3 then we use CloudFront to push those results out to data centers that are geographically closer to our end users. We have users at institutions all around the country. We are at the moment restricted to North America. And then so when the on-site software on the whole runs needs to load terrain it's simply HTTP calls to CloudFront URLs and those are just secured with SignCookies. So I'll talk a little now about how we deploy our system. We use the Ansible tool. So this is an IT automation tool probably many people are familiar with. It's similar to tools such as Chef, CloudStack. And what Ansible gives us is the ability to capture the configuration of our Linux machines in code that we can check into our version control system. So the goal of this infrastructure is I would like things to be set up so that I never have to SSH into my servers. I'd like to be able to just create them and destroy them and deploy them automatically from scripts in source control and never have to SSH and I would be exaggerating if I say that that's always true sometimes we cheat and we SSH and manually configure things but that's the goal and Ansible gets us a lot closer to that goal. So Ansible organizes things into playbooks. It's a very powerful tool and actually the session in this room later today I think is about deploying web apps through Ansible. So if anyone's interested in Ansible check that out. So I'll show a couple snippets of the playbook that we used to deploy our Jenkins instance. So these are written in YAML so the first step is we need to create a new user named Jenkins and we'll run that command as sudo and then we need to ensure that that user has SSH directory with proper permissions so we'll create a directory home slash SSH again using sudo with proper mode and then we need to copy some configuration files up. So we can point Ansible at the config file template and where we want to put it on the remote node and the mode we want. Ansible provides pretty good tools for templating these config files and swapping out variables for different things. So we have a set of environment we can deploy for production and environment we can deploy for staging and development and we separate these two with these variable substitutions. So when it comes time to actually deploy a new one of these instances it's as simple as creating a new EC2 instance which we can do either through AWS console or through an Ansible playbook and then pointing Ansible at that instance with the dash I flag and then the script to run and the environment. And then this final argument is the vault password file so the way that Ansible deals with secret keys and credentials that need to be deployed to the remote machine is that it stores them in a file called the vault which is encrypted and stored with your project and when you run Ansible you need to provide the password to unlock the vault so that it can pull out the pieces that it needs. One time when I gave this talk at JPL almost immediately after our talk or my talk we accidentally terminated our build manager instance it was complete user error on our part. So I can speak from experience that this actually really is as simple as these two commands to recreate that instance. Because if you check it into your source control system you don't want unencrypted passwords and keys available. Also on that occasion we've learned about EC2 termination protectionism which is a good thing to turn on for your critical nodes. Alright so we use Ansible to configure all of our Linux machines. We're not yet using it for Windows I would like to be in the near future we haven't had time to get around to setting this up. What I would like to do is describe for our worker images we manually configure a machine image that serves as the base of our worker node. I would like to be able to build those with Ansible instead of doing it manually because any manual process is error prone especially if I'm doing it. So my goal for the near future is to use Ansible to automatically provision everything we need to go onto an Amazon machine and then bake that into the base machine image. To summarize a bit things start on Mars CuriosityRover captures new imagery that imagery is downlinked to the mission data system on Earth. The build manager periodically pulls that system for new data. When new data is available it's indexed into a database running MySQL on Amazon's relational database system. I request that the Jenkins master allocate a do node to do the work of running that terrain reconstruction. Jenkins will find a node to do the work, spinning up a new one if necessary. That node pulls the source files that it needs from the mission data system, runs the terrain reconstruction and then pushes the results out to the on-site S3 bucket and notifies the build manager that a new build is available. At that point that we can use for a train build. I'm afraid I don't know off the top of my head. That's a good question. The way that we run the pipeline right now is we actually run it on a single node at a time. So one job runs on one node. The reason that we've done that is we chose to optimize vertically before we scale that horizontally. So for this phase of the project our time targets were to build which we were able to hit by vertically optimizing and pushing some parts of the pipeline onto the GPU. So to keep things simple we've kept things for the moment running on a single node. That said, I do not expect that to be the case going forward. That's correct. Yeah, welcome. Okay, so once the results are in the build manager and in the on-site S3 bucket, the next time our users launch the on-site software running on the HoloLens, they'll pull down the terrain and see the latest place on Mars next to where the rover is currently. Okay, so what I've described here is very much a snapshot of a system as we've built it now or as we've built it and as it's running now. I'm not going to say that it's the best possible solution to this problem and we do expect to improve this going forward. One thing we'd like to work more on is improving our auto-scaling ability. So I described how we use Jenkins to manage the size of our worker cluster and scale it up and down based on the size of the queue. Now it may occur to you that most cloud providers offer kind of similar things that also can create constraints based on some kind of workload. We made the conscious decision not to use basic scaling features from a cloud provider because when we were designing the system we wanted to maintain some level of cloud independence. So we ended up rolling some of this stuff ourselves. And it's worked out pretty well for us but having gone through this I think we would actually rather not be writing and maintaining that code ourselves. So we'll probably in the future be looking to see if we can swap out some of our auto-scaling logic with something that's just provided by a cloud service provider. As I mentioned before I want to start using Ansible to manage the Windows worker AMIs more closely. That's one of the kind of the fiddle bits, manual bits of the deployment process right now in question sir. Right now the image processing pipeline is this monolithic thing that goes all the way from stereo images from the rover all the way to a 3D terrain reconstruction. Now obviously there's a lot that happens between these two endpoints and we'd like to split that up into a little bit more modular services that do different things and can be scaled out more easily because going forward we will need to exceed the 3 hour threshold by a lot looking toward 2020 when the Mars 2020 rover will land. So I do expect to do a lot more of smaller services that are scaled out horizontally probably all still using the GPU and then we'll need to look into how to distribute the source images to the different workers that need particularly those images. Right now we're in a kind of easy parallelization world where each worker can just pull all of the stuff that it needs and it works out fine. So I'd like to acknowledge the great work of all of my colleagues on the on-site team and our partners at Microsoft. I hope you've enjoyed this talk. I hope you've learned something. If you have any questions for me during the conference feel free to grab me or say hi in the hallway. If you have questions afterward feel free to email me at parker.apicrombie at jpl.nasa.gov. If you'd like more information about the on-site project please see our website opslab.jpl.nasa.gov I would love to hear it. I'd love to hear what you liked and what you didn't like at the Google forum at the bottom. With that we have some time for questions. Sir. Mm-hmm. Okay. So the question is why are we using Windows for the worker instances? To be honest that decision was made before I joined the project so I don't have a completely historically valid answer for you. In general though at the time the decision was made they felt that was what fit the problem best. So. Sir. So the question is why can't we just throw more workers at it to reduce the 3R window and yes we can and that's probably what we will do. We will need to rewrite parts of our pipeline to run in parallel on multiple workers. So up until now we focused on by design we run on a single machine and we focused on running as fast as possible on that single machine. We do parallelize extensively on cores on one machine but we made the intentional choice not to scale out to multiple machines yet. But going forward I think we'll do exactly what you have in mind. There's one up here and then. Okay so the first question was how does CloudFront scale I believe you mean geographically? Yeah. So I'd recommend looking at the CloudFront documentation on Amazon for its worldwide reach within the continental US which is our region of interest. It has very adequate coverage. And the second question was what kind of GPU problems did we run into? The challenge we ran into is that on our windows machines for some reason the GPUs sometimes weren't recognized possibly due to misconfiguration on our part. I'm not 100% sure. I'm actually talking to some engineers at Amazon next week about exactly what was happening. There was a question back here. Yes we are. To be honest I haven't looked into a lot of detail on this. One open source tool that caught my attention recently that I have not evaluated so I can't really speak to is Spinnaker which is an open source project from Netflix. I saw an interesting presentation by some of their engineers last week and we'll be looking more into that. Yeah okay the question is is there anything in the open source community open CV and other image processing libraries like that can do to improve our situation? That's a deep question and the short answer is yes. My hope would actually be that some of the innovations that we've made in this pipeline can be pushed out into the open source community. I don't have the details enough in my head right now to speak to that in much depth. I'd be happy to chat with you offline now in the back. For archiving our strategy at the moment is to keep things in the Amazon's data system. Archival hasn't been a big focus of ours yet but probably moving forward will want to sync up with other parts of the Mars mission and leverage whatever technologies they're using. I'm afraid I don't know off the top of my head how they do it. In terms of ... I'm afraid I can't answer that. The question is are each of the scenes that we build complies entirely of I think you mean new imagery from the most recent downlink as opposed to older imagery from earlier in the mission. The answer is the latter. What we will do is for a scene we'll find all of the images that are taken in that region of Mars. That includes what was most recently downlinked and anything that might have been downlinked in the past. So in theory the rover could actually drive in a circle and you would pick up data from actually much later in the mission. This has actually happened a couple of times in the mission. So deciding which pieces to use, I showed a slide earlier that showed color imagery and black and white imagery next to each other coming from different places. This is actually a very deep question. How do you decide if you're doing a mesh reconstruction, you have a set of images and for any given part of that mesh multiple images saw that place, so which one you're going to use. We have some heuristics that optimize this for our use case. There is a lot of different ways that you can do this. To give you a couple of examples we kind of pretend that Mars is a static thing where nothing ever changes and this is not true. This is a lie. The most obvious thing changes is this huge Jeep-sized rover runs over it. So you even see in some of our scenes like rover tracks appear and then disappear because some images were captured before the rover drove and some were captured afterward. But you could imagine using your heuristics of which imagery you're going to choose to skew the reconstruction to capture the state before the rover drove or after the rover drove depending on what you're interested in or any number of other things. In this application, no, we did not have such a requirement. In the other parts of the Mars Science Laboratory data system there are images available to the public that can be searched by different things. I can't speak to exactly which technologies they're using. In this application, to finish on your point, we limited the amount of metadata that we track to really just the bare minimum of things that we need to know about and thus far the searching has been within the scale that we can do within a relational database. I think a question over here. The question is, is the final data product available to the public? The answer right now is no, but there are some thoughts in that area that I'm unfortunately I can't share right now. Stay tuned. Not that I know of when the decision was made, Amazon was what we chose. Yes, I think that is possible. I think the more interesting question is can you send the data from a data system directly to the HoloLens? Getting the data from Mars to Earth is actually a whole other question. And maybe not surprisingly the bandwidth between Mars and Earth is kind of limited. Yes, sir. So the question is how soon do our users want the data, I think, to paraphrase? And the answer is now. It's actually a very subtle question that you're asking. The way that Mars planning works is there are certain planning meetings that are time sensitive for planning refer operations. The way that those meetings line up with how data comes down from Mars is very complicated. Ideally, our terrain would always be ready at the beginning of the meetings. That's not always the case. So in some cases we may get downlink in the middle of the night to do the work, in which case no one is calling us. In some cases we may get downlink half an hour before it's supposed to be ready in which case it won't be ready. So the answer is the users would like the data as fast as possible and we were trying to meet that need. Scouter. The question is, is there a way to view the data without using the HoloLens? In onsite today, no, it is a HoloLens-only application. However, I don't expect that to stay that way forever. In fact, not for long. Sir. That's what we aimed to make it. That's a great question and that really speaks to why did we even do this? Because we have these 2D images already so why even bother with this 3D reconstruction? The reason that we do this is that I think there is something perceptually different in looking at images in 3D versus 2D. Ideally you would send a scientist to Mars and they would look at real rocks for real. We can't do that. The next best thing is we can put a scientist in virtual reality and using the HoloLens you're not tied to a different place, you're tracked so you can move around the scene and walk around it as if you were there. You can use some of the spatial cues that you use in your everyday life on Earth to understand the scene around you. At the end of the day, geologists are looking at rocks. They're trying to understand how big are the rocks. How are the rocks laid out in relation to the scene around them. It's very hard to do this in a 2D image. There was some work before this project started that motivated the whole thing that tried to quantify how people understand a scene that they see in a panoramic image versus a scene that they see in virtual reality. I'm not going to go into that today because I don't have the numbers in front of me, but the punchline is that people are actually dramatically more accurate in understanding a scene when they see it immersively than when they look at it in a 2D. Even experts who do this in 2D. That's the whole motivation for building this application that lets you look at Mars immersively. Skylar. The question is I think you're asking if you walked from one scene to another along the rover's path would you switch from Orbital into the new scene. That's something that we would like to support at some point and unfortunately we don't today. Well conveniently we have a rover on Mars. So the question is why did we choose Mars as opposed to other planets. Sorry to be a little bit snarky there. We have much better data on the surface of Mars than we have for other planets. If you think back to the orbital maps that I showed at the beginning of my talk there's a lot that you can see from orbit and Mars is absolutely beautiful when seen from orbit as are other planets and moons, but for a virtual reality experience where you're trying to put someone on the surface orbital data just doesn't really get you there. It's just not high enough resolution. So where we have really good surface data is Mars where we have robots that collect this for us. At the moment we don't have such data for other places. Earth. Yeah, okay, Earth is a counter example. I'm not sure who was first. Man. The question is when in the process of planning the Mars mission did this idea of creating a virtual reality tool come up? I'm not entirely sure to be honest. I was at JPL for a year and a half. So I was not at JPL when the curiosity landed. I know that the ideas of looking at Mars and virtual reality have been around for years and years. This particular project has been going on for about two years and this group has been doing work in this area for much longer than that. But when the germ of the idea that is now on site was born and I'm afraid I don't know. The question is there one for the moon? We are not doing one for the moon today. I don't know if we will or not. I think you were your first, sir. Sure, so the question is is there a reason that we chose an augmented reality technology such as the HoloLens versus a virtual reality technology such as Oculus? One thing that we get from augmented reality is that we wanted our users to be able to continue using the tools they're used to while they're in their virtual world. So I didn't really go into how on-site works in user interface but we actually detect where your computer screen is and then we cut that out from Mars. So you see Mars everywhere else and then you see your computer screen. So the reason we do that is you can continue using the tools that you usually use to work on data from Mars while you're looking at Mars in virtual reality. So the question is how accurately are the spatial details maintained when going from a 3D to a 3D reconstruction? This is a deep question so I'm not going to get too deep into it. We aim for the kind of centimeterish level accuracy in our mesh. There's a lot of ways that noise can come into the system and in fact the source images that we use we get range products out of those just by the nature of how stereo correlation works. The quality of that range falls off as you get further from the rover. Fortunately usually the things we're interested in are near the rover so that's not too much of a problem for us. One of our next steps actually is going to be doing more quantitative analysis of our mesh versus the input images. So the question is how specific is this technology to curiosity and is it possible to process data from other missions as soon as you're talking about spirit or opportunity or other Mars rovers. In general this is not specific to curiosity. This is something that can be applied to any kind of stereo reconstruction process. In practice currently we are a little over fit to curiosity. We have talked about supporting other Mars missions right now we don't have any plans to do so because it will take a bit of development effort to make it happen. The question is do we use location data about where the rover is to inform the image stitching process and the answer is yes we do. I think that's a little more detail than I want to get into today. I'd be happy to chat with you offline. Okay maybe one more question. Well we'll see. Okay well thank you all very much.