 I don't really know how to say this, but I'm going through a little bit of like personal issues right now So I'm normally a very upbeat and positive speakers. It's gonna be a little bit hard for me today So I'm gonna give this my all There are a ton of really amazing sessions happening at scale if you don't want to listen to somebody who can't give 100% today I would encourage you to go see them, but I really am gonna do my best So I was told that I need to use the microphone Regardless of whether or not you can hear me So that it can be recorded. So hopefully yeah, here we go. So Yes, that's not the right resolution Womp bump. Okay. Here we go all right, so I I Read academic papers and then I steal their research and I turn it into Twitter bots with emojis So that millennials can play with it. I am not a particularly good software engineer I'm not a particularly good machine learning person and for some reason Amazon gives me ability to go out and talk to people and Show off cool stuff like this So I'm gonna present a lot of research that other people did all I've done is add like a little blip on top of it I have not done any of the fundamental math, but if you ask me about the fundamental math It's really really cool and they do all some stuff with fractal geometry and all kinds of other things I'm a software engineer at Amazon web services And the original work that this is based on was by J. Young Choi and Kevin Lee of the international computer science Institute and University of Berkeley one of the things that I'll say is There are lots of ways to build models with all kinds of different data sets You can even create your own data sets one of the things that I think you should take from this talk is that by using Open data sets and actually publishing open data sets you can advance a lot of model building and a lot of feature engineering That you can't normally do so I Build a lot of demos. I write a lot of blogs prior to working at AWS I also worked at AWS and then I worked at this very small company called tension Which became MongoDB and is now publicly traded and I would encourage you to perhaps purchase those shares And then I think Python is a beautiful language white spaces at the limit or perhaps not the best choice in the world and JavaScript is essentially a mass psychosis that said And you know what Stockholm syndrome is So Stockholm syndrome is when you fall in love with the person who takes you hostage So that's kind of how I feel about JavaScript is it's the it's the devil you have to love And this is my official badge photo and they asked me if I wanted it retaken and I said nope Think it sums it up perfectly so This is where ML if you want to try it out This is a model that we built on an open data set called the multimedia comments data set We basically took all of the images from Flickr and all the videos and all this other data and we created this creative commons Open data set that I thought was pretty cool and you can download it. You can use it on S3 I think it's probably like 18. They're so terabyte total, but if you want to see what this looks like in action You tweet at the spot a picture and that picture can be Any set of arbitrary pixels and we'll very confidently tell you exactly where that image is 16% of the time So this is a picture of Seattle from an airplane and you'll see I'm posting it and what's coming up is Hopefully Seattle and there's all kinds of cool stuff that we do with reverse geocoding and emojis and Google Maps static API the emojis are mostly for the millennials like me But they're also really cool and Unicode is a weird standard So what about machine learning? How does that play into everything? How does open data play into everything? well with machine learning at AWS we have a pretty long history of Building stuff has anybody ever bought a TV on Amazon and then did you get told to buy 10 more TVs? So obviously we're great at this right like we know exactly how to do machine learning. No We do have a pretty long history of machine learning one of the cool things that I recently worked on was this idea of Predictive auto scaling so if you think about the way you traditionally do auto scaling is you do a 4a transform of the periodic elements of The the graph of scaling you extract out those periods and then you say okay cool now I know what times I need to scale at but that doesn't work when you have weird declined in growth from you know Longer periods or a constant 5% increase you have to use machine learning you have to use some form of prediction to Find out what that time series that next step is going to look like so we've now enabled that by default on Any kind of auto scaling group you want in AWS, which I thought was pretty cool We also fly drones we use Alexa Alexa is that device that goes and Listen to everything that you say so that we can sell you advertise. Sorry That's the Google home Alexa is this device that maps Phonemes so phonemes if you're a noam Chomsky fan or like the fundamental components of human language It maps these phonemes into words those words into phrases and those phrases into intents. So the intent is hey Alexa Tell me what time it is or you know something more exciting than that Alexa buy me Long Island City, New York and or Alexa buy Whole Foods And you know that intent gets mapped into an action that can be undertaken by a piece of code So that's an example of machine learning and then the Amazon go stores anybody ever been there in Seattle we have this thing called the Amazon go store and What you do is you walk in and you steal a bunch of stuff and then it gets charged to your image Amazon account later. It's a It's an eventually consistent shopping cart This is what it looks like. Oh How do I do this? There's no reason for me to show you this video. It has nothing to do with the talk. I just find it very amusing so Dude walks in by the stuff leaves There is one terrible mistake that a person makes in this video that outline it for you now Here we go Okay, so she walks in she scans her code Did it come on Here we go So if you're ever in a store and you pick up a cupcake under no circumstances should you ever put that cupcake back? That was a conscious choice. Yeah, there you go. She did it right always always take the cupcake so this uses a couple of really cool things that uses sensor fusion it uses RFID it uses recognition video which parses video files and Can tell you where different people are in a scene for instance It tells me very confidently that I'm a celebrity it tells me when I broke my leg a few years ago and I was trying to learn to walk again and It told me that I was hobbling not walking which I found personally offensive and I filed a sev two ticket with the With the team that runs that service But very interesting things and very cool sets of services are available to you on AWS That's kind of the extent of my vendor pitch. Hopefully there's one other service I'm going to talk about called sage maker which I'm absolutely in love with and Even though I work at AWS and I have to be in love with it I would be in love with this and I would use it even if I didn't work there and I don't say that about every service so Our mission with machine learning at AWS is actually to put tools that were previously in the hands of like these fortune one Fortune 10 companies with billions of dollars or these ivory towers caltech that have tons and tons of stuff and awesome resources To throw at these problems We want to take those same Tools and put them in the hands of as many people as possible and if you don't believe that Go read Nature magazine. There's this really cool article about this hobbyist mathematician who found a novel graph coloring theory With an EC2 instance in 10 minutes of matte potlip It's like Or not matte potlip matte lab when you Basically bring a diversity of perspectives to bear on and a diversity of data to bear on complex issues You are getting Multiple shots at solving a problem in a unique and novel way and I have proof of that today So the way we think about machine learning we kind of split it into four tiers at the bottom tier Most people never touch this are the instances. These are things like Tesla v100 GPUs These can play an awesome amount of Minecraft. Let me just tell you these are Kara floppy compute available to you. You have the Machine learning AMI is primarily the Ubuntu machine learning AMI This is a deep learning AMI that has pre-optimized framework like TensorFlow and PyTorch and Keras and all that good stuff And then green grass ML for running Like inferences on the edge so you can essentially run Lambda functions deployed on IOT devices. I thought that was pretty cool and those lambda functions are machine learning inferences and there's a really cool component here About optimizing those models for those edge devices So think about the way different instructions that's work. We have things like the AVX 512 instructions that You have ARM processors that have a much smaller instructions at the next 86 You have different memory access patterns for all the different buses that you're going through and different ways of Kind of chatting with all of this stuff and most of that is abstracted from you. However, when you're trying to optimize for inference most people try and optimize for training, which is not the right approach I could spend a lot of time talking about why that's not the right approach But optimizing for inference is where you're going to spend the bulk of your time That's where you're going to spend the bulk of your money in your compute is actually serving inferences serving the predictions that you train this model for so why not optimize that model to be as performant as possible on as many different platforms as possible and if all you need to do is change the Topology or the input vectors a little bit so that different instructions that can take advantage of different sizes of buffers and things That's a trivial problem to solve especially if you know what instructions are available I mean that's what compilers do essentially is they're saying let me optimize it for this particular target So we open source something called Sage maker Neo which will optimize a Any machine learning model in a common format like tensorflow or mechs that are any on an X based model It will take that and optimize it for any end platform So whether that's a m5x large or some particular instance type or if it's an Intel atom or if it's a Microcontroller, I think that is pretty cool. I'm really really proud of that work and it's still ongoing You can go check it out on GitHub and you can see it yourself so The second tier is we want to be the best place to run as many of these frameworks as possible Don't worry about glue on no one releases anymore, but everything else We want to basically optimize and build for as much of this as we can And one of the things that we do is we try and contribute back to these frameworks. So that means adding in Tools like elastic inference. So you don't always need a full GPU to be able to serve inferences You might need a full GPU to train But to serve inferences you might only need, you know Portion of the GPU and rather than over provisioning with one of these servers. Oops. Sorry. I don't pay my AWS bill So I just over provisioned, but you know if you did pay your AWS bill, which you should You you might not want to build one of these like p3 624 X larges because I think they're six dollars an hour or something like that and What you can do instead is use something called elastic inference And this is where you do some network attached GPU and you can provision between 1 and 32 teraflops to serve your inferences This works really really well for kind of I would say Image-based data. That's not hella big For some bound of hella big I don't know where that bound starts exactly, but I can you know, I definitely don't know where it ends but more than ResNet less than like HD photograph, so somewhere in that range this elastic inference can do a lot better Does anybody still use Hadoop? Yeah, it's dope, right? I think I think EMR people just kind of forgot about it and don't really like it or use it anymore Because everybody wants to use like the new fangled fancy Lambda serverless transform, whatever But I like writing Java from time to time. Well, don't tell anyone. I know And then you have things like SageMaker ground truth and mechanical trick Which will allows you to hire an army of interns to go and label your data for you Interesting stuff there. It actually trains an automatic classifier to go out and continue training that stuff So then you have at the top level. This is where I see most of the customers that I work with spending their time You have things like Lex, which is a chatbot service that lets you talk to your friends and family or not You have Polly, which lets you go out and synthesize speech It can do this pretty well actually a kind of really really interesting stuff One thing that I'll advise if you are ever using Polly, has anybody seen the Stephen King the movie it based on the Stephen King novel? You remember when it's like we all float down here Yeah There's a voice that sounds exactly like that and if you use it I will do everything in my power to get your AWS account closed because that is terrifying It is the most terrifying voice I have ever heard in my life and it's synthesized. It's like imagine if Clippy had that voice Anyway, then you have recognition and image and video which I already talked about transcribe which Transcribes things it does speaker idearization, which is kind of cool and then translate which does neural Translation between different things does anybody speak German? There's a really common German word called donald dumpship that's because I'll stop the time which is a river boat captain on the Donal River and typically translating compound German words is a difficult thing in translation But when you use neural translation, it becomes a little bit easier. I could talk more about that But that's not important Comprehend as any analysis you can analyze medical documents extract out personally identifiable information Then you can use forecast and personalized forecast and personalized are interestingly enough built on top of the same model as stage maker Which I'm going to talk about at length a little bit later. Okay, lots of customers do this stuff people like C-span and Zillow and snapchat and you know, I don't know whatever this eye chart says we have those customers How many people are not familiar with machine learning? For the purposes of any video that was a hundred percent of the audience and I'm totally going to launch into a Brilliant description of how machine learning works right now. I stole this from another amazing individual There's somebody named Grant Sanderson Grant Sanderson runs a YouTube channel called three blue one brown and he has a series on neural networks If you would like a PhD in neural networks and machine learning Please go check out his whole channel. It is Phenomenal so I have some animations from him that I would you know, we sponsor his channel and we We are using this these with his permission, but he will explain it far better than I can Has anybody ever seen this image before? This is a very common data set that's called MNIST. It's a series of handwritten digits. There's 60,000 of them So the way you would write a traditional program to identify What these handwritten digits are is you might define an imperative feature? So you would go in and say hey, I want to take a Convolutional kernel across this image and scan for different features and if these different features exist Then I know if a loopity loop in a line exist, then I've got a Nine if I've got two loopity loops, then I know that I've got an eight if I've got a loopity loop and Online and the line is on top of a little bit. That's a six. I Don't know how to write my own numbers. I haven't used a pen and paper in years So essentially what happened here is that is the old way of programming is imperatively defining What each set of steps should look for you you can do this and people have done this to great success but it's really hard takes forever and Honestly, it doesn't work that well For generalized use cases. So what do we do now now? We make the computer do all the hard work. Does anybody remember a programming language called prologue? So prologue is when you define your input and your output and you say hey computer go figure it out for me That's a dramatic over simplification of the language But that's very similar to the way that we deal with machine learning is we will go and we will say hey I Want to take these images and they're labeled so I know that they're either zero or nine I want to propagate them through some network. These are 28 pixels by 28 pixels. That's 784 pixels I want to spread all those out. I want to throw them through a network a couple of different nodes and I want to Determine whether or not this image corresponds to a zero a one two three four five six seven eight or nine and I do that a lot of times a lot a lot a lot a lot of all the time And what I'm doing is I'm calculating the epsilon the error from the predicted Output and the actual output Now you can run into errors here Which I might not remember to talk about later So I'm going to talk about it now One of those errors is overfitting and what overfitting is is when you basically find a novel lossy compression format It's like you're making an MP3 of The data because you're saying I am going to just memorize this data set and not generalize I'm not going to learn anything about the data I'm just going to find a really neat way to answer all these questions correctly Which is basically how I got through college. I did not actually I dropped out, but That should tell you So let's take a look at how this works again. These are not my animations These are from Grant Anderson three blue one brown his amazing YouTube channel. Please check it out So we have each one of these pixels Painful So you have each one of these pixels twenty twenty eight by twenty eight pixels splits out into two hundred seven hundred eighty four pixels and what happens is You are connecting each one of those nodes in that first layer to one of these hidden layers these hidden layers are You can have as many as you want the the network you don't typically do a lot of particularly complex things in Fully connected networks like this. You're just connecting each node to every subsequent node However, you can define extremely complex topologies in this kind of graph But what we're doing is in these hidden layers is we are connecting each layer to the next layer So these neurons are essentially modeled after the human visual cortex So if you think about the way the human visual cortex works, this is not how it really works But I'm not a biologist so I can lie and say that it is If you think about the way the human visual cortex works, there is light coming from some light source. It is going through my eye Exciting some, you know nerve in my optic nerve that's attached to my retina those rosin cones are getting some increase in signal that signal is coalescing into Me detecting a very bored audience and perhaps I should move on and so what'll happen is We take that same concept of all of these neurons connected to each other and we represent it mathematically we represent it in a graph and we can find all the edges and We can assign an activation bias and a weight to each of these edges and then we can iterate Over all of those and say hey, I just sent you an eight. I know it's an eight, but you gave me a number one so I'm going to back propagate that error and Adjust the weight so that next time you might give me an eight and you just do that process a lot a lot of times And that's how machine learning works and that's the new way that we program things interesting So I can show you an example of this Or not Yes, I can so a neuron you might think of it is just kind of a Static weight, but it's actually not it's a function a neuron is taking an input of 784 pixels or 16 pixels or 16 items 16 different inputs and Outputting 16 different outputs. Can you imagine writing a function that had 784 arguments that'd be fun I Can imagine that in Python? I'm sure there's a limit actually in the like stack frame or something of how many arguments you can put in But you can take this and you can say Okay, well, I know of these things are lighting up Then maybe this is detecting little tiny edges and this is detecting loopity loops and when I get little tiny edges And I get loopity loops that makes a nine And your question might be is that what the network is doing? and We can all guess is that what the network is doing is it learning little edges learning little loopy loops Anybody want to venture For the purposes of any video everyone's like yeah, that's totally what the network is doing But you're all wrong no sometimes the network does do that sometimes it doesn't There is a amount of chaos and randomness and machine learning which I find kind of attractive and enjoyable You can do different kinds of starting points. You can use different kinds of optimizers You can use different kinds of functions along each of these edges People used to use something called sigmoid don't use that it's expensive and it sucks There's this really good one called re lu that works for basically everything and you really don't need tan h or or any of these other weird activation functions just use re lu it works just fine and You know Google has done a lot of really interesting research into visualizing what's happening in convolutional neural networks They just published a paper a couple days ago actually about that where you can introspect into these deep networks What's happening? What's being visualized? What is the what is the network seeing and learning I? Thought that's pretty cool. So Sometimes the network does do this sometimes it doesn't With a small network like this you can actually dive in and see what the network is doing with much larger deeper networks You cannot the compute capacity and the the cost to figure out exactly what's happening at each layer is Not really worthwhile So how do we do machine learning? We start with a business problem typically or You are on twitch.tv slash AWS and you're coding and somebody throws out an idea and that's your business problem And you say okay Well, let me frame this as an ML problem and then let me collect a bunch of data Let me tell you how this typically works in real life is this section of the business problem in ML problem Faming doesn't typically happen people are like I have a lot of data What do I do with it? And then they're like that's higher a data scientist and then they're like let's hire a machine learning person and then the data scientist and the machine learning person get employed for about two years and Then they decide to quit because there was actually no business justification for it in the first place And that's how most machine learning problems go In the cases where that doesn't happen You have you know some strong executive leadership and they're going on and they're saying hey I have this really cool idea. I want to bring machine learning and data science to bear on this problem Let's see if we can get it working. So You frame the problem you collect your data you do a lot of preparation and cleaning I will actually argue that this preparation and cleaning is vastly more important than the design of the network I will tell you why I'm a little bit and then you do something that is so vitally important to this entire process That if you don't do it you you Probably should find a different job and that is visualizing and analyzing the data before you use it to train now Why is this important if I gave you a list of 13,000 numbers and I told you to find me all of the outliers How many people could do that very quickly? Without code All the hands went down Now if I show you a plot of 13,000 numbers and I say hey draw circles around the outliers. How quickly can you do that? Instantaneously the visualization and analysis step is so vital and so underrated in every machine learning process all the customers that I work with They're always like hey, let's just throw all of our data at this problem They never think about Meaning or averaging or taking out some of the statistical anomalies in order to make their models better So please do that and then after you do that you can bring some of that good old human intuition to bear on the problem Where you can highlight different features of the data and you can say hey within this data I want to figure out what exactly you're gonna affect the outcome the most so To play heroes of the storm. It's a blizzard game. It's quite fun. I Play it obsessively perhaps too much and I wanted to train a machine learning network to identify Whether or not I was going to win a match of heroes of the storm It was very easy. I just returned. No But when I decided to actually train several nodes what I was able to do is I was able to take that good old human noggin Apply it to the feature engineering stuff and say, you know what probably doesn't matter about the the You know what data points probably don't need to be in this network the names of the other players Because I have never even looked at the name of another person that I'm playing with so that probably doesn't actually matter in this network What else doesn't matter the The the the skin that they're wearing like you can you can apply different models on top of these characters Really probably doesn't matter. There's some arguments that some skins have a psychological effect on the players and but that's all nonsense What matters is their rank How many actions they're performing per second and other kinds of concepts? And so I can go and look at all that player history and then I can go and predict Whether or not I'm gonna win and of course the answer is always no so That's where you bring your your human knowledge to bear on the problem and you highlight certain features in the data set Then you take the training and parameter tuning. This is relatively easy. You say model dot fit or data dot forward and then you just let it run in Until you get something that works or you don't and then you start over and To do this you evaluate your data and to prevent that overfitting that we talked about a little bit earlier you keep Some of your data out of that initial data set so what you do is you take I I don't know 10% 20% Some portion of your data you throw it to the side and you make sure your network never sees it So how does that work? You take that network you take that data you push it through the network again And if it's doing well if it's never seen that data for training It will tell you correctly what the right answers for that labeled data are and if it's not go back to the beginning and do it again eventually hopefully The business goals are met business goals It's very corporate sounding right Eventually the business goals are met and you go and you deploy the model and then you serve your predictions and you monitor And you bug it how many people think monitoring And debugging is about disk space CPU and memory Yeah This is a great metrics to track They mean nothing for machine learning models because that should be on by default anyway Those are your instances what you're monitoring with machine learning is whether or not the inference was right If you want an example a network that's been trained Where they haven't actually monitored if the inferences are right try using all correct on your phone so If you're not getting some sort of feedback about the predictions If you haven't built in some mechanism to go and collect additional data about the inferences that you are serving You will be sorely disappointed over time as your model becomes out of date and you start recommending TVs to people who have already bought one and And then there are lots of AWS services that you can use but the coolest one is stage maker So Amazon PR and legal have told me that I need to stop saying this but stage maker is my favorite service We're not supposed to have favorites. This one is my favorite. It's really really dope It is or so I'm only designed so you can use any tool component of stage maker Without having to use all the other nonsense. So if you already have existing stuff that works well for you You can go and use that without having to To pull in this whole big mess of a service you can just use the parts that make sense for you So let's cover what this is It's really four different services in one on the one side. There's a Jupiter notebook These Jupiter notebooks are great. Has anybody ever used one of those? Yeah, it's a it's a fun little tool Pip install Jupiter There's a new one called Jupiter lab, which is equally cool and it's more of like an IDE But whatever there's this great set of notebook instances, but I can also run those locally on my machine I don't really need to run those in the cloud unless I'm on a plane And I want to start some long running job and I might not be connected to the internet the whole time In which case hey spin one up in the cloud go run it and then connect back to it once I land Then you have built-in algorithms those aren't really the purpose of my talk today So I'm gonna kind of gloss over it as quickly as possible and then you have the training This is really cool. It can do things like hyper parameter optimization Which is where you define a nice Beijing classifier that goes out and select out the integer parameters categorical parameters anything You can write a regex four in your logs and it will go and iterate over those things things like learning rates things like optimizers You can use SGD or atom or whatever you can even change different topologies in the network And you can say I want to try out all of this and I want you to keep iterating and run all these jobs simultaneously until hopefully I get One that works I think that's pretty cool It saves a lot of time if you want it to run quickly you just say hey run lots and lots of jobs simultaneously Keep in mind if you're on lots and lots of jobs simultaneously You might not always arrive at the best possible network, but who cares? So then you get into the hosting service This is great. It's Docker. So you take a Docker container and you're like lol go run the stocker container and it does it Lots of cool things there you can do a v testing you can load multiple models onto it You can shift 30% of the traffic to one model 50% of the traffic to the other model And then I forgot how math works or the 20% of the traffic to another model and The the Jupiter notebook instances the algorithms. It's all pretty cool One of the really fun parts of this is there's an open-source SDK called the SageMaker Python SDK And what you do is you take your existing Keras, TensorFlow, MXNet, PyTorch, Caffe, Chainer. Chainer is a great framework if you've never seen it before. I really really like it. It's really cool You can actually do stuff where you modify The the network as you're training it so you're modifying the topology of the network as you're training it, which is pretty cool You can take any of those different frameworks and you can say hey I want you to or scikit-learn, you know if you want to just use scikit-learn You can take any of those and you just go and say Here's my script. I want you to build a Docker container around it that Docker container is going to both train On on this data that's coming from S3 or coming from wherever and it's also going to Host inferences so you can do both of those things in the same Docker container, which is pretty cool Yeah, oh, and you can use GPUs which you should for certain kinds of image data, but if you don't need those then don't use them Okay, yeah so When you deploy these models, of course, it is really really cool one thing that I will say there is a Small limitation on each container that they were limited to 10,000 transactions per second So if you need more than that provision a couple of those, I don't know why that limits there There's really no logical reason for it to be there, but if you're doing more than 10,000 per second good for you So I want to talk about using different data sets to improve model So we have this thing at AWS. It's an open data set program So we have two different sides of this we have the registry of open data, which is a github project That looks like this where you can go and submit your own data sets add your stuff in and then we have something called the public Datasets program and these are data sets that are you know, LL good for humanity I don't know what the right phrasing or corporate terminology for that is but things that are you know helping to solve all-timers or breast cancer or or Sustainability initiatives or earth imagery or you know one of the really cool data sets Which is public which I'm not sure entirely how how good for sustainability is We took all of the public data from the Hubble Space Telescope and made that available for anyone to use It's really really cool. I mean it was already available for anyone to use but now you don't have to pay for it. So This registry for open data is pretty awesome. You can go and add your own stuff to it I would encourage, you know, if you want to take a Action item from all of this please go and add any of your cool data sets to this because that lets other people go out and build models on your data and Overall we kind of democratize access to machine learning in that way We cover the cost of the storage for these publicly available High-value cloud optimized data sets. That's the correct corporate language that I was looking for and we it really lowers the cost of storing and working with these large data sets and If you want to create your own labels or if you want to add labels to the existing data sets You can use something like Sage maker ground truth and an army of interns So the multimedia commons data set which is mostly based on Flickr images That is the data set that I'm gonna be talking about today, but there's also a series of sustainability data sets the What is it called the digital globe Set of images. I remember I used to work at NASA Ames and one of the things that we did there That's up in Mountain View, California Aside from having a really cool wind tunnel. We analyze mosquitoes from space using thermal data, which was Really really cool in like 2011 And then you have 35 terabytes of data from the Allen Institute for Brain Science in the University of Washington all of the open street map features data set so if you want to Build automated tour guides and stuff you can do that all the data from the Hubble Space Telescope And then a lot of genomic data sets as well Really really cool stuff I wish I was employed to just play with these data sets all the time and I would constantly be building Twitter bots, but I'm not so how do we build the model for where ml for those of you who weren't here earlier where ml is a robot that takes pixel data from an image and it will go and Identify all of the locations in that image and tell you very confidently Exactly where you are 16% of the time. So I totally did not make this the originally I have iterated on it significantly But we funded this research to something called the AWS cloud credits for research program So if you are an academic or if you are accustomed to writing grant proposals I would encourage you to check this out. We do fund a lot of academic research with credits So if you want to run giant analyses in the cloud and you have some kind of really really awesome grants Write that to us and we'll probably pay you I get the opportunity to review a lot of those for the West Coast of the United States and I will say I am blown away at the absolute complexity and Like what's the right word here the the absurd level of Intelligence that our customers have like they are doing fundamental research into things that happen You know in the first femto seconds of the big bang is really really amazing pushing the edge of knowledge stuff the work that I'm talking about today was actually based on work by Wayland at all who are also based in Southern California. I think And who work at Google and they released this model called planet and this was built on I think some some Model topology called Google net is a 40 gig model. They trained it on several hundred Million images anybody's Google photos Yeah, so that's there you go You help train that no, I'm teasing. I don't think they use your personal data. I don't think they would get away with that Google That said That model was great But it was also massive and there are issues with having massive models So instead of using Google net or TensorFlow or whatever that framework was We're gonna use of hatching a next net and we're gonna use Google's s2 spherical geometry library Geometry in spheres is not like like maps and projections I'm sure everybody's heard this before maps and projections are a non-trivial problem to solve So use a library that's already solved it for you Rather than trying to do the math yourself. Anybody ever been to China and tried to use GPS Or like Google Maps or something You'll notice that you're always off to like the top right by like half a degree of latitude because they use a different projection And so the GPS signal the way GPS works is you have these these these crazy little satellites in the air and they're sending Time pings and then your phone is receiving those time pings are just looking for those time pings and saying okay Well, I know that this one is there that one is there. Let me triangulate you find all these little circles And then you use Wi-Fi signals public Wi-Fi signals that are stored in this big database and machine learning network to identify Where you are really really cool stuff. I can talk more about that later if anybody's interested, but In China it doesn't work because different map projections So use s2 spherical geometry library and get all the right map projection and all the right complex math In three dimensions done for you Easy right so really awesome library and it has something That uses a piece of fractal geometry called a Hilbert curve to build sparse Compact indexes on geographical data, so I'll show you how that works So what we're going to do is we're going to take all 33.9 million of the images with exif data after we extract out all the black images all the white images all the single-color images all the images that are of low value so ones that are You know the mean pixels are very very dark or very very bright You know something that doesn't have a lot of complexity to it We don't really want those images because they're not going to add anything and that's why that Visualization step is so important is I can plot the mean pixels of each image across the graph and I can go Let me select out, you know, let me draw a little line on my iPad and select out all of these So then we take the remaining 33.9 million images we build this cell network By using that two-spherical geometry library and then we get an et2 instance and we train And we train and we train and choo choo No, we keep training until we reach something that we like and so we train for a couple epochs over I think nine days in the first iteration of this subsequent iterations We've been able to get the training time down to less than three hours, which is pretty fun This is built on ResNet 101 ResNet 101 is a really cool model I think it was proposed originally by a Microsoft research So the way that that deep learning came about is there's this person named Jeffrey Hinton who has worked at a bunch of different universities And he's done a bunch of really amazing fascinating work. What he does is he goes out and he says Hey, I want to to use deep learning to solve problems Except no one believed him for 30 years and then he turned out to be right and now he's famous and all this PhD students are famous too And that's how academia works. So one of the students I think built this ResNet 101 architecture and as a Microsoft research publication This is Essentially a cheat code in deep neural network So deep learning you have an issue where the deeper the network gets the harder It becomes to propagate the error backwards and there's lots of cool research about how to Solve that problem and skip around it But one of those things that ResNet does is it adds little cheat codes at different layers Where the signal is strong enough you can have this little attention function You can have multiple different convolutional kernels running simultaneously such that you can Amplify certain signals So you are able to train just those really big convolutional blocks. Why don't I just show it to you? So I don't need to explain this right? Makes perfect sense Very straightforward so that's ResNet 101, so I think it has a hundred and one layers or something like that and You're building all these kernels you're building in all these cheat codes to basically amplify different signals So this is what the S2 spherical geometry library makes We take all of the images with their latitude and longitude data. We build a CSV Well, if you're good at programming you build the CSV if you're not good at programming You just call Python.pickle and then you go and You'll notice in places that are really well represented in the data. We have a lot of different cells many many many cells So Paris, Tokyo, London, you know any city with more than five million people you have really well represented in the data now It does this using a Hilbert curve. This is not important, but I think it's cool This is a piece of fractal geometry that will preserve locality across the XY coordinate plane as you add more and more folds And this is really weird thing where like the human brain has like Resimblance to parts of the Hilbert curve, which is just freaky Okay, so there's some pros and cons of this approach of classification instead of Like regression or or like trying to define it as a a continuous number line The pros are it is surprisingly precise in cities like Paris, Tokyo, Vegas and other other unique landmark places The cons are 711 signs and McDonald's signs because I think McDonald's must use the same exact light bulbs everywhere in the world There are some other cons too because we built it on ResNet You're limited to 224 pixels to 224 pixels. Luckily most of the data is also very small So you have to make some decisions intelligently about the data To find the pixels that are of the most value in that image before you use it to train the network And I did not do that. So If you post an image to Twitter Rather than trying to calculate an aspect ratio and crop towards the most important part I just crop towards the center and then enlarge it to 224 pixels by 224 pixels So if anybody wants to improve this would be a great way to make the bot better So here's some cool statistics and stats So it was built with disbelief and then use 2.3 million Geotag clicker images in the test set 91 million images in the training set and then I think it used 2.5 months on 200 CPU cores to train Planet which resulted in a 40 gig model and then location that was 300 megabytes, which means it fits on most devices and it also performs better than Planet which I also thought was pretty cool and this highlights the importance of using kind of open datasets and using feature engineering and Visualization analysis to kind of make your networks better. So how does the AWS infrastructure work the Twitter API? Does anybody work for Twitter? Twitter API exists and You can use webhooks. There's an account activity API within Twitter that runs these webhooks and goes and talks to API Gateway which talks to a Lambda function I'm going to function talks that SageMaker which is running our inference code Which is right talking to a Docker container in ECR which is elastic container registry I will talk about Twitter to anybody who wants but Kind of an interesting choice series of choices they made about APIs So this is what API Gateway works You have your your Twitter coming in and then you're doing a proxy invocation for a Lambda function That Lambda function is parsing the coming tweet verifying the tweet is from a particular person It sends the media URL to SageMaker endpoint and then uses Twitter API to respond We do some cool things here. I can show you some of those cool things But first I have to talk about security security you have a consumer secret that Twitter will provision for you It turns out that you could use any consumer secret for a brief period of time during the beta and access anybody's webhook But that was fun If you wanted to be a good net citizen what you would do is you would look for that consumer secret You would look for the CRC the challenge response code or something like that and you would verify it and the way You verify it as you write an HMAC which is a message authentication code or something The task you can use shot to 56 GitHub use a shot one because they use shot one for everything and then you send that across back to Twitter to sign the CRC and Then you can verify that the request is coming from Twitter by looking at the header that they give you and decrypting the body of The message and making sure that HMAC all match matches up and don't use equal to equal here Because that makes you vulnerable to things like timing attacks and other fun stuff security. Yay Then after we validated the record we've gotten the information from Twitter What we do is we go and we call this invoke endpoint which pops us over to Twitter or pops us over to SageMaker The way that SageMaker works is we have a docker container running. It's looking for artifacts for the models and flash opsash models You can put as many models as you want on there and then well, no, I think you can only put like 64 or something you can put a lot of models on there and then you can go and Write a flash gap that responds with an empty 200 Okay is the best health check that was ever invented and flash invocations in order to get your JSON data out The way that this works on the machine learning side is we import a mech net We import good old numpy and we load the model in we bind the model to the CPU We say it's not being bound for training we define a little data set in the shape of the data Which is 1 by 3 by 224 by 224 Which is how resnet works and we set the parameters by loading those in we create a name tuple And then we load the grids that grids file if anybody remembers is this thing where we have this you know in an ideal world this would be like a CSV or something intelligent, but I just use the pipe pickled python object because You know, I I coded this live on twitch and I Didn't have a lot of time Then we forward all the data through the network whenever we want to Predict something we sort by the probability So the output of this network is really all 15,000 cells We sort by the probability and we grab the top however many we want and then you return those And this is the coolest part of this project in Unicode. Are you aware? That if you add a particular offset to any country code you can get the emoji flag of that country This is something I was coding this live onto it and someone just like shot this over to me I wish I knew who it was I have looked through the chat to try and attribute this to whoever did this but This is the dumbest thing I have ever seen in Unicode And Unicode for those of you who are not already aware has a lot of really dumb things You can do math on integers in Unicode by just adding the bytes together There's some absurd stuff, but this is this is absolutely hilarious and it works for any country code so Take that And then again the best help check in the world And then we we we say get JSON even if it's not JSON and then we return JSON Even if it's not JSON and that is how the whole thing works. This is the Docker file It's the only important part here is this reverse geocoder. That's a Python package that's open source It maps latitude and longitude coordinates to place names Please don't post any pictures of Iceland because all the place names are longer than 280 characters and Twitter doesn't like it When you post things that are that long so I get a bunch of errors People have actually tried to DDoS the bot and the API by By posting places that have really long names pictures of places that have really long names So you can go through that if you wanted to be a bad person you could go through the reverse geocoder Doc and you could find whatever the longest city name is and try and find a picture of that that would be fun Twitter's API is generally available for some definition of generally available and The the way that this works is they send you a web hook whenever some event happens You consume that web hook prior to that there was something called the user streams API Everybody use this it worked great and the world was a happy place But I can understand why they would switch off of that model because it's expensive to run And so I am all for them making changes in their infrastructure and gracefully deprecating APIs and adjusting over time That is not what they did What they did was they They made it so that the only way to register a web hook was regardless of the number of web hooks You you would tell them about the only way to subscribe a particular web hook to a particular end point was to delete All the other web hooks so that it would only have one choice because if it didn't have one choice It would just pick one at random That was in production that is still the case for certain of these end points, so Unless you're willing to fork over a ton of money to Twitter machine learning is actually really fun and easy and Data is really kind of accessible through all these different programs what I would say is by using open data we were able to make a significantly better model than the ones that previously Existed of course it's 2019 This was done in 2015 and 2016 so Lots of things have happened since then and everybody's constantly trying to one up each other Google's guys is really really cool paper where they were combining different classifiers and different apologies now across that network, but Regardless check out open data. This is really easy to get started with machine learning is fine There are lots of tools we built this live on twitch.tv Which is why the code where it was originally hosted. I made the mistake of letting twitch name it They decided to call it body McBotface, so that is the github URL of the project There is thankfully an alias for github.com slash ran man slash where ml Which is where the the code will direct you to thank you so much for coming to my talk today I apologize for being late And if you have any questions, we can probably do those off of a camera and off in the hall there. Thank you Can it be done locally or something? Okay, so a hybrid more can become a free-tail model Thank you They should but they're if they're not Thanks for all the help Yeah Yeah to do his actions. And the projects are from NASA and they are public, and they have a group here in Pasadena who have built me two different projects. They cost really little with these computers that cost 30, 30 dollars 25 dollars. Impressive, really impressive. I read a lot of models and I read them. They are really interesting. There is another group that has 3D printers and I wrote them. It's one of them. Ah, yes, that one. Yes, yes, yes. Yes, indeed, three-dimensional. But I don't know why we are doing it. What is the disc? There are two two laboratories that are doing to make them grow without the need for wood. And then there is another company that does a hamburger made only of plants, vegetable plants. And some plants that put us inside have a substance that according to nutritionists is the one that gives the meat the taste of meat. And then there is a completely vegetable hamburger that is made of meat. The smell of meat, of meat. There is already a project in the city where there is a project to build the city, but it doesn't work, it doesn't work. You don't need any equipment to build the city. That's why there is a lot of work to be done here. There is a lot of work to be done here. Yes, I think it is the best to have the earphone put to handheld. It's probably better to have the earphone put to handheld. It's probably better to have the earphone. The other thing is that I can work in the back when people have questions. I'll be starting in just a couple more minutes to see if we have any stragglers. Thanks. Hello everyone. My name is Nicole Martinelli and I'm going to share with you this project that I'm working on called Resiliency Maps. And it starts out in San Francisco because that's where I'm from. As you probably know, we have a lot of earthquake-type disasters happening. So I wanted to talk a little bit about what it means to be resiliency prepared. The tools we're using to make these maps, which are all open source, some of the challenges we've been having in terms of open data and how you can go about making your own maps. So I'm a San Francisco native. But by show of hands, who else lives here in California? Oh, lots of people outside of California. Okay, anybody outside the U.S.? No. So we Californians, we know about earthquakes, floods, firestorms. We're not so good with blizzards, tornadoes or anything that could be considered that kind of severe weather. But the fact is disasters are really starting to get more geographically agnostic. They're happening everywhere. And the idea behind the map is basically in any given area you can chart assets, hazards, with a view to staying safe in an emergency. This is a little more gloom and doom than I had intended because we're obviously the first statistic is people who died. But I really hope that you'll leave here with this idea that you're thinking as somewhere between K. Sera, Sera and apocalypse now. Because resiliency planning is really all about thinking ahead a little bit of awareness and just like the right amount of paranoia. You don't want to be completely crazy about that. So I changed what I was going to talk about a little because I've been tracking what happened in Alabama this week with the tornadoes. Anybody else follow that? And 23 people died. And a lot of those deaths happened in an unincorporated area. And so my first thought, because I'm a nerd, any people done emergency training? Oh, awesome. Oh, like lots of certs here. Awesome. Okay. So if you've done that training, they teach you part of it is you have to have a plan, right? And I was thinking in San Francisco we get these alerts from the city government. If you live in an unincorporated area, how are you going to get that, right? It turns out that a lot of these people died despite having 12 whole minutes to get out of the path of this tornado that was going at 170 miles an hour. But what happened is people didn't listen to the warnings. They actually got them from the National Weather Service on their cell phones and they didn't go, including the reverend of the local church that's about 10 miles away from this. Where in 2011, they had built out a shelter, commercial kitchens, showers. They were ready. They did that in 2011. And from 2011 to now, people are like, I'll sit this one out. I mean, even the reverend was at home. So what ends up happening, if you don't have a little bit of foresight, I spent the week kind of watching, thinking about this. Then you're stuck with the Beauregard Volunteer Fire Department has a Facebook page. The church has a Facebook page. The Department of Emergency Management has a web page that they haven't been updating so much. So let's say, like the reverend, you manage to stay safe even though you don't shelter. How do you know where you're going to go? And the interesting thing too is, again, the shelter that's only about 10 miles away from really where the epicenter of this was doesn't take animals. So again, it's something that we learned from Hurricane Katrina that people will not evacuate if they don't know that they can take their pets. So these are all factors that make a difference in what ends up happening to you. So I want to do a little thought experiment. And I have a cool prize. So we today are here in Pasadena, which again, all you Californians will know that this is earthquake territory. And earthquakes happen here. If you can see from this, more days than they do not. So it's like 347 out of 365. You're getting a little wakey-shaky. Not a big one, but you're getting something. So I thought I would like to see how prepared the scale community is today. And it's a little bit of show-and-tell. And I have this really cool gadget thing that it's a multi-tool that you can take through the TSA. I know because I've tried it. So let's see what you've all got in your backpack. So starting with the easy stuff. Who has an extra phone battery? Has anybody got hand sanitizer? OK. Tissues? OK. Anything approaching like a first aid kit, like the random band aid? OK. Nice. OK. Good. Let's see. Oh, look. Wait. So oh, all right. Wait. No, it is show-and-tell. What have you got there? A first aid trauma kit. So this is someone who's done NERC training, or CERC training. And I see you've actually got the CERC on your. So I just kind of feel like you automatically win. But let's see, what else have you got in there? Yeah, that was, yeah. Anybody got food on them? Water? OK. Oh, gosh. I almost want to like exclude you because then like you automatically win. Oh, that's great. That's awesome. OK, well, that's exactly, that's why we do it. Because we're all volunteers. And the whole idea is that it goes kind of in neighborhoods. And I'm the co-coordinator of my neighborhood in Selman. We'll get to that in a bit. Did anybody pick up some great swag, like a t-shirt? OK. So I kind of love going to prepper sites because it's just not my mentality. I'm a little bit less like, oh my gosh. But if you really want to fall down a rabbit hole, look up something like 100 ways to use a bandana because they live for that stuff. So if you've got a swag t-shirt, you could use that to cover your mouth if the emergency were a fire. You could use it to wave for help, eye mask. You could filter water. I mean, there are all kinds of things. Oh, somebody's got a bandana back there. There you go. Exactly, or a scarf. I kind of, coming from San Francisco, you always have your layers, right? I see you've got a scarf as well. You can use them as bandages in a pinch. All kinds of good stuff. So, and again, the idea is, here's how prepared we are as a scale community. And you've got to imagine, let's just say the emergency is an earthquake. Fire, obviously, we're going to have to evacuate. But in a lot of situations, the conference center ends up being the assembly point, right? So in some ways, we're really good. But you've got to imagine, we're going to be here for 24 hours, folks, probably, who's got a hotel with an electronic card key. I'm not going anywhere. I'm definitely better here than trying to go back and get in the 14th floor of my hotel with an electronic card key, right? That may not work. So the idea that we're all here, well, luckily, he's here. But we're all stuck here for 24 hours. And in a conference situation, you've probably already got more stuff than you would normally. I've been known, you leave your house, go to Jimmy. You've got your phone, right? Oh, here's another word. This is my last test. Who has cash? Yeah, so again, that's one of the things I tell you in an emergency, because ATM folks not going to be working, OK? So that's definitely something you think about. And small bills, I think you clearly, hands down, this is our most prepared person. Come get your prize. Oh, well, I think then maybe you had a lot of stuff in the back. No, no, yeah. Exactly, so it's a mini clip tool. I can actually pass it around if you want to take a look. It's kind of a good thing to have. OK, so again, taking a step back and talking about basic preparation for any kind of emergency, if you look into this stuff, there's a great website called 72Hours.org. Then they give you all the standard information about what you should have in your go kit. One of the things that if you take a cert training class, is that they also tell you to keep a map. And part of the training that they do, especially in a place like San Francisco, they teach you how to look more at your surroundings. So we'll get to this in a minute, but San Francisco and lots of parts of California have buildings that collapse really easily, and you need to know which ones they are, right? You also need to keep an eye out on the construction sites, the mechanic shops, any of those kind of things that could be potential hazards, as well as do you live near a large school with an open parking lot that could be an asset? There's fewer things that are going to fall on you, right? But there really aren't any maps that give you this information. And I will tell you that kind of embarrassingly, in my first NURT kit was a San Francisco tourist board map. And it was before they did it to scale, and it didn't even include Hunter's Point. And obviously, like I said, we all go through various stages of this, maybe unless you are the EMT, right? You're in it. We all learn as we're sort of going along, right? So this is an overview of our toolkit. Does anyone here already contribute to open street maps? Oh, nice. Awesome. Or any people who work in GIS? Awesome, nice. I forgot to ask, and I think I know the answer to the question, are you a ham radio? Any ham radio operators? Two, that's pretty good. Okay. So this is just an overview of everything that we're using, and I will get into these a little more as we go along, but I just thought you wanted to see a snapshot. So this is what San Francisco looked like in 1989. This is about almost exactly 30 years ago, and the USGS is predicting that in the next 30 years, we're gonna have another one of the same magnitude. And we'll talk more about these buildings, but the ones at Pancake are called Soft Story. And if you think about it too, in 1989, this was way before San Francisco's big tech boom. We've had a huge population explosion. We've got a lot, hundreds of thousand more people in the Bay Area, and because the affordability has gone down, we have a lot fewer EMT personnel across any, any way you look at, a lot of few hospital personnel, anybody who can afford to live in San Francisco. So our nerd, they call it nerd, it's cert in the rest of the country. We gotta be special, but they teach you basically that you're gonna be on your own for like eight to 10 hours because part of it is the geography here that no one's gonna be able to get in from Oakland. Bart shuts down, 400,000 people are stuck, right? So that's kind of the gloomy part. And when I first took the nerd training, I just in the neighborhood looked like this. So I was really, again, not quite to the level of the reverend who didn't go to the own shelter in his church, but that's Golden Gate Park. That church has a huge empty parking lot. Behind it is a grocery store, which is great in an emergency, right? Also with a big empty parking lot. There's a public library nearby, which is a natural shelter. Not all of the fire stations in San Francisco will be open in an emergency, but as luck would have it in this neighborhood, the battalion was near my house. I knew where the police station was. I didn't really think about it too much. So I did the certification and kind of forgot about it. Then I moved to a neighborhood that looked like this. It's basically all cranes all the time. It's the south of market. And I recertified with NERD, and to be honest, I completely panicked. I realized my tourist map is not gonna cut it. I have no idea where I'm gonna go. Talking a little bit about open data, you'd think that San Francisco would be just like the epicenter of open data, and to some extent it is. So there's so much of it that the San Francisco Health Department has actually mashed up a bunch of data sets from opendata.gov. And they've come up with this Climate Community Resiliency Indicator System. So this takes 36 data sets that they've waited in terms of physical danger and the community and all these kind of things. And they've come up with this map. So the darker green ones are the more resilient and the lighter ones are the least resilient, and it's one to five. And I basically realized that I went from a four to a two. Okay, so south of market's right there. And you can also see that either if you're looking at, like, I need to be safe, I need to go to Castro upper market, right? I need to think about going to those neighborhoods. But if I'm an emergency citizen responder and I need to think about which places are gonna need more help, I mean, all I have to do is just the, what is it, civic center, the financial district, China town, easy, right? But again, you're thinking about where you need to go in an emergency. And actually I live on the border of that. So the Castro upper market would be like my closest, safest neighborhood. So when I realized that my panic was somewhat justified, I took it back to NERC and I pitched using open street map. Now here, I'll do a little bit less preaching to the choir than I normally would. But of course, even in NERC, I wouldn't the first person to think of this, right? People have been using Google Maps to make these maps. And what had happened was you would get a really committed group of volunteers in one neighborhood who would spend, in one case, they spent three years working on this map. They would go there every weekend. And then when Google started charging for its API, they had nothing. The other thing about volunteers and crowdsourcing is in general you have this problem, people move away. People find other things to do. So I pitched OSM and I pitched it really hard. I had been volunteering since the Nepal Quake in 2015. I really felt in Sydney with like other sort of earthquake prone places. And I thought this is gonna be great, easy, right? But every time I got pushed back from the other NERCs, I made a tutorial. So is it user-friendly? Well, okay, here's how you make your first edit. Can I do it with my mobile phone? I don't know, I'll try it. If I wanted to try to use it in street map, but I don't wanna use a computer, can I do it? Sure. So the other interesting thing about OSM for this is this last bit about the fact that you can add features to the map and do math import. So if you're working with Google or Apple or whatever, you can't just decide that you need a feature and add it. And with OpenStreetMaps, you can. This is a little just quicky view of my neighborhood. One of the other things that OSM does great that's important in a place like San Francisco is there's really good visualization of the construction site. So that's the feature that I'm looking at right there. But you can see there's a car mechanic, there's a collision center, lots of things that could explode, refrigeration supplies, any place with a hazard diamond, right? Which is another one of those things that we teach you in these classes to look for. Storage, that could be full of stuff that could do it. So you can kind of see what you would end up looking at in this. And as far as pitching the features, we've made inwards on a couple of things that are really interesting. So that little device in the corner, it's a police and fire call box and there are 2,000 of those in San Francisco, they're almost every two blocks and they run on telegraph technology. That's how long they've been there. Now, they're the only things that are still expected to work in an earthquake. Now, the interesting things about these, again, you think about geography very specific, but the interesting things that you learn that a lot of these things are common. So when I started to see like, does any other place have these? Turns out they do. So in Hurricane Sandy, after Hurricane Sandy, Mayor Bloomberg was forced to keep the ones in New York because you like redundancy in an emergency. You need that, right? So they have them in New York, they have them in places on the East Coast, like New Jersey, I believe, and in places in Europe. So this is a feature that we actually pitched and got accepted by the OSM community. So you can actually, you know, if you live in a place of these, you can tag them now and upload them. The other wrinkle about the open data. Yeah, you can go on OpenStreetMap and look at all the fire boxes that we've mapped so far in San Francisco. Well, I will, but I also think the great thing about OSM is that anybody can do it. So it's not dependent on me and we've got a fairly strong NURT community, but we don't have as many people in it as we'd like to. The other thing about NURT and this idea that disasters, like until something bad happens, no one's doing anything about it, is that after the California wildfires, the NURT classes for the entire year are booked. You couldn't get into one in San Francisco if you wanted to. So that's a whole group of people that aren't gonna be trained. So the interesting thing about the open data is that there's a lot of like who's got the data set kind of thing. So these things are actually the responsibility of the Department of Technology and I personally have not been able to get a map of these anywhere. Now because they're 2,000, I think the other thing is if you get enough committed volunteers, you just do it, right? So, soft story. This is another interesting OSM story and open data story. These are the buildings at Pancake typically in emergency and the definition is, let's see. In which one or more story have windows, wide doors, large unobstructive commercial spaces, garages or openings in places where normally a sheer wall would be required? So there are still 4,000 of these in San Francisco after the 89 earthquake. Now half of them have been retrofitted but the problem with retrofitting is they are earthquake safe, which means hopefully they'll be standing afterwards but there's no way that they ever meet the standards of modern construction. So if I'm trying to figure out where to go, I deaf or do damage assessment, I definitely want to know where they are. So, I didn't think there was attack for this in open stream map because I thought it was a San Francisco thing and then I ran into a slightly interesting problem. The OSM people know that it started in 2004 by a British guy, which means that we have a lot of potato-potato spelling issues. So I don't even know how I found it but it's actually spelled story EY and the World Bank had done a project in Kathmandu where they already tagged 13,000 of these buildings. Now they didn't actually propose it to the community officially so we went through and did that and now if you, as long as you get the spelling right, you can tag these as well. On the open data part of this, this is the Department of Building Inspection. So the soft story list of these 4,000 buildings not through the city's open data portal, it's owned by the Department of Building Inspection and they put it on this lovely Google Fusion Table which you can see is going to expire. I don't know, any Google Fusion Table fans, right? I'm sad that it's dying but so as of December this year, Google Fusion Table is going to die. I have no idea what they're going to do with this but you can see of the buildings, most of them are in San Francisco. So the other interesting thing about this is that you can get these as points and not polygons. If you download the whole thing, you can kind of do a little bit of magic with the block and lot number that you get from another database and get the polygons for these buildings but at that point you probably still need a volunteer to go out and check because the thing with soft story buildings is once you kind of get your eyes around what they look like, you can spot them and if you take nerd or cert training they teach you how to look for them but really only an engineer can be 100% sure. So it's probably better if you do a math import and then verify. But you can see basically there's no neighborhood in San Francisco where you're not going to get these. To your question about how you're gonna make a custom map of these, we've been working on different visualizations and this is a prototype that we did with QGIS. So again, welcome to my neighborhood. And you can see that those two red buildings up in the corner that's construction and I apologize again, the colors are a little alarming on this but it was something we did very quickly and I live right in between those two construction buildings. So as we said before, if you're trying to get from there to say the Castro, all of those little green buildings are soft stories. So your best bet would probably be to walk along market but if you were trying to get to some of those darker green areas on that side of town that pink is an elevated freeway. I'm not going under that. And then even if you get under it look at how many soft stories there are. So your best bet is to either angle up or if you're looking to do, you're ready to deploy and help people out, you just go straight up. But again, this is sort of what you'd be able to do with your own neighborhood. So we've run a couple of mapathons and the first one was with the nerds in San Francisco and this is field papers which answers the question, yes, you can map with pencil and paper. So we got about 25 people out and we gave them just a few data targets. They were looking for soft stories which we have the database that we could verify. They were looking for auto mechanics and again, anything, a restaurant, a commercial kitchen, anything that can explode. And fire hydrants, stuff like that. We gave them five or six data targets and as you can see, you can draw with pen or pencil directly on it. So you make a grid with field papers of the area that you want to map, print it out, give people the sheets. They mark it up and come back and this is probably the only time I've ever done anything useful with a QR code because it overlays it on OSM. So then you can just upload those edits there. The other one we did was a more traditional one using Mapillary and we did it with map time. So again, like more people who know maps so it was straight up computer work and this is the Pasadena Convention Center and I don't know the OSM folks if you've ever known that being in Mapillary have these street side views there. So basically this is, here we are at the Pasadena Convention Center and things seem to have more coverage here so that's what you'll see. So all of those little green dots are photos that people have taken here and I'll show you what it looks like if you go into one but it's basically another that's the icon there and you can go in and either edit or check stuff. So in this case, here we are again in Pasadena. We've clicked on that little yellow icon and I don't know enough about Pasadena. Does anybody know what these yellow fire hydrants are for? Specifically? So one of the things that, well, so one of the things in San Francisco is that certain fire hydrants, they have blue tops and white bases have potable water in them. So I don't know if Pasadena's got some, I know. And they do have to open them too but that's a whole other. So you could put, if you couldn't get a database of fire hydrants and you were just gonna have people doing them, you could do that. So you could put a point and you can see it's under those trees but because you've got all those photos you could look at some other ones and verify that that's where the point is. And the reason that you might wanna do this is because again, for San Francisco, the fire hydrants, that's not on the open data either. That belongs to the water commission. They don't release that data. They don't want people knowing, even though the fire department will teach you which ones have potable water and what tool you need to open it, they don't want people knowing which ones they are. So the only database that you can get that's public in San Francisco, fire hydrants are the ones that have been graffitied because the firefighters can't read them then so they have to go and clean them up. Now every fire hydrant in San Francisco might have been tagged at some point or another. So maybe if you were gonna start at a mass import that would still be a pretty good way to go but that's some of the stuff that you're up against in terms of the open data. So on one hand there's a lot of it but on another hand, like in terms of its accessibility, sometimes it's a toss up of whether you'd be better off just getting a lot of people out with the mobile phones or doing more mapathons. So the whole idea is really that you'd be able to make a map, print it and download it wherever you go with an eye to your safety and I did think about that a little bit more when I was coming out here and I printed out a very poor map of the area which sort of gives you an idea of like, it's got the hotel and the convention center and I can see that there's a park on it. I was using map hub. So again, it's another open source tool that you can do that. But really, we're trying to think more broadly about other things, anything that can cause disruption to your community. So San Francisco, as you know, we don't really do hot, right? We don't do heat waves like our air conditioning is when it gets over 70, you open a window and wait for the fog to come in. And nobody's gonna make a movie about the rock coming to save you because you're sweating in San Francisco. But we have had a couple of instances in the last four or five years where people die because there's no air conditioning and they don't know where to go, right? So again, the good side of our data mask up is that they're all kinds of maps like this again from the health department which will tell you right away which ones are the most vulnerable. So again, if you get a lot of people interested in map making and thinking about these things, the last heat wave we had, the NERC volunteers were deployed to what they call cooling centers which is a very fancy way of saying some place with air conditioning. And there are public libraries, some government buildings, the public swimming pools but you could get a lot more people out. It'd be very simple to make a map because the other map they have is like the percentage of people over the age of 65. So again, you're looking at people who are older and living in more vulnerable areas. It'd be really easy to get a bunch of NERC together and say like let's go into these neighborhoods and talk to these people and make sure they know where to go, right? But you've got to make that shift and get people thinking about maps and making them and also just how to get out of situations with whatever they've got, you know? So I think that is the overview. This is how to reach me. I'm happy to answer any questions you have. That was kind of a whirlwind. So if we have any questions, I can pass you the mic. Hi, though, this looks like a wonderful project. The exposure I've had to community maps before for publicly accessible fruit trees or gardens. And I was wondering if, well to preface that, a lot of those maps are kind of terrible. They're very ad hoc and they vary regionally. Is this a project that's nationwide? Hopefully. I mean, our idea was just to really make a, try to come up with a process from beginning to end where you could make any kind of map you want that's basically about this theme using open street maps. So yes. Okay, and then my real question was would you welcome porting that data from various food maps or fruit tree maps, that kind of a thing, into the system? Or is that an apply? It's an interesting question. I guess it depends on what you both mean about importing them into the system. I mean, I think you could use open street map as a base layer to make that kind of map with another one of these tools. But there is a process that you could try if you wanted to import fruit trees. And I believe OSM has trees. I believe that trees are an item. So you could probably just try to specify getting fruit tree. And then another tag, it's like you could specify what kind of fruit trees, but I'm sure you could probably do that within OSM, but it wouldn't necessarily have to be a resiliency project. I mean, we're really focused on avoiding disasters. So fruit trees, again, it's a resource, but it might make for a noisy map if it's got fire hydrants and stuff on there. Okay, thank you very much. Thank you. Well, I have a question. Yeah. So what's to stop you from putting fire hydrants on the map, like is the water commission gonna be like, no, you can't have these, I'm gonna erase them now? Nothing, and that's part of the reason why in our mapathon we had people tagging them. So we didn't have people tagging as potable water, because I know that's a problem, but you could tag all the fire hydrants you want, you could put all the call boxes you want. I mean, it's still, any data to me that you can walk around and map is public. They can't keep you from doing that. They can keep, you know, you can't convince any bureaucrat that they need to give up a data set that they don't want public. I mean, that's what comes down to it really. Have you tried Freedom of Information Act? I couldn't for you that, yeah. You know, I guess as part of it, my perspective is that it's almost always easier just boots on the ground, this stuff, depending on how big the fire hydrants are definitely big enough to make it worth probably a foya. The call boxes you could, you know, in 2000, I know how many there are. You could probably do that yourself. I mean, with a group of people. So thank you, yeah, live in LA. This is, I've been through several of the earthquakes and such, it's good to know that. So I work for a fairly large government agency, and so I'm curious for you, what is like the unobtainable data set? I mean, because most of this is public, is government data that you're probably looking for? So what are the ones that are just really hard to get to and, you know, just in what your experience is with that? Yeah, that's a great question. So starting with, when I started in the fire department, I just thought I could get a list of the fire hydrants and what else, and the call boxes and all those things from the fire department. The fire department has a GIS office with one person in it and they make maps that are PDF. They referred me to the Department of Emergency Management and I don't think the Department of Emergency Management really sees themselves as public-facing. So again, I'd really be interested in hearing your perspective too about what that divide is because the San Francisco Department of Emergency Management's got all kinds of data, all kinds of maps, but I don't think they really feel like they need to share it with the rest of us. The other interesting thing about the software situation is I would love to do an entire California map of those, but every single city, county, in this state has a different department that is responsible for those. So I don't think in LA, for example, which has a lot more software building in San Francisco, that public data set is available. You can look at parts of it through the LA Times, for example, but I don't think, if I just wanted to try to do a mass import of that to OSM, I don't think that's available and I think that's really a shame, I mean, worse than a shame, but let's call it a shame, yeah. That's it, yeah, so that's a really great question and the question is, you're dealing with a crowdsourced project, what if people mess with it? What if people put things on, it's not true. So far, we have trained most of the people that are doing this and when we did the NERC with the pen and paper, nobody was dealing with OSM so they couldn't mess up the tax. When we did the map time, everybody's laptop, we had a couple of people who were new to OSM but we gave them the preset tags so they couldn't, unintentionally mess it up. Where I think we've gotten into some problems are with there are projects that are paid for by companies in open street maps that there's one that I think in San Francisco, this is our problem, autonomous cars, right? So for autonomous driving, there's a small group of developers who I think, I'm not gonna say who I think they work for, but anyway, it's public, it's not, I don't wanna say somebody and then be wrong but there's a small group of developers in Europe and their job is to map what they're calling the through ways, so it's basically like if you have a drive through in a Burger King or something they're mapping that as a through way and from our emergency perspective, any open parking lot could be an emergency assembly point. So you get in these problems with open street maps, is like how many tags can you put on this thing, right? Is it a through way or an emergency assembly point? And that's the kind of thing I think we're gonna have to do more work on because again, it's diplomacy, but I don't think I've, I haven't come across anybody defacing the map intentionally for these purposes. Of course, people have to face OSM but we haven't really come across that yet, thanks. Yeah. Under the assumption that you want these maps to be usable after a disaster. Yeah. Have you looked at what would be required to get network access to the maps that are out there? So our thinking is you're gonna print them and keep a paper copy, which is what the emergency responders tell everybody to do. It's in your go bag, it's in your car, you're not gonna be downloading and printing it on the fly. The idea is that for at least 72 hours, they imagine that you're gonna be out of juice, no power. So each NERT is gonna have the ones that they're relevant to them? Yeah, because again, you work in one place, you live in another place, maybe you're in Pasadena. I mean, I really was thinking about it. I, you don't know the area, you don't know where you would go, right? You print it out. So we are looking into that with a couple of, and we're trying to make it as easy as possible because the whole thing is, you know, NERT is a 501 3C and we could get an Esri license, but I haven't solved any problem if I have an Esri license because then I'm gonna be responsible for making all those maps. Thank you. Hi, he's right here, yay. Hi. Hi, for almost any application, it would be handy to have census data. Are there any open street maps that have any census data in them? That's a really great question, I don't know. I don't know the answer to that. I could find out, but I don't know. You've mentioned the mapathon and you showed us a couple examples of, you know, how you broke it up in tiles and how people went out and picked it up to four data points or so. Can you give a little more context to just how the mapathon came about, what were some of the, how many people were there and how extensive it was and maybe just a little bit more of, like, So which one did you, we've done two so far about. Are you more curious about the pen and paper or the computer? What pen and paper, like, how it was all organized Yeah, so I did a write-up of that on our blog if you want to read the whole rundown, but it was really fun because we picked an area, we were working with nerds, so you kind of know their training, they know how to spot buildings, they know how to look for things that could be potential disasters and we picked a neighborhood around the fire department training section, which is, again, south of market. There's a lot to be done there, let's just say, like a lot happening. So they came, it was two hours long, they had a list of five or six targets, that's all they were looking for, but knowing the nerds, we did give them a write-in option because I knew, knowing the neighborhood, there was gonna be stuff that they were gonna find, but like, wait, I saw, you know, this chimney. I mean, in addition to the soft-story buildings, we have these unreinforced masonry buildings, which you also have in Pasadena, there's all kinds of, you know, warehouses that are made out of tin and all these crazy things. So I knew there would be write-ins, but we gave them the option, like, you wanna do that, we'll try to upload your edits. So we sent them out with, we made a grid of the neighborhood, we toggled them so they're not overlapping each other, we gave them 45 minutes and we said, go, it was just like a treasure hunt, find as many of these targets as you can, mark up the map, here's a key to stuff, you know, so they have shorthands of, you know, they're not writing soft-story all over the map, you just, you know, an abbreviation, they came back and we started just to show them how to do it, uploading the edits. So we did just a couple, again, we didn't upload everything that they did because there were, I think, I wanna say a hundred changes, like a hundred changes through in a couple hours, but they were just out there for 45 minutes, pen and paper, looking at stuff, making notes on that. The stuff we're looking at doing is identifying staging areas and we don't want that to be public for security purposes, we don't want the public coming to our staging area. Correct. When is the power gonna be back up? My water is not as brown, we don't even do about it. Is there a way to keep some of this information private that we collect? So what, okay, that's a two-part question and we have that same problem with NERD in San Francisco. So you wanna think of OSM as your base, right? And that's gonna fire hydrants on it and all that information that's public. But what you wanna do is the part of the reason we're looking at QGIS is to do exactly that so that you can make maps of what you need. It's a different tool. You can have the base layer of all of this information that you want but you can add things because again, yeah, that information is not public. The other thing that we've been thinking about too are the shelters. So we decided to tag any place that's a common sense gathering place like an empty parking lot as the tag in OSM is emergency assembly point. Now San Francisco Open Data, they do have a data set that's of shelters but as you know, during an emergency, you don't know which ones are gonna be open. Now personally, if I've got a paper map, right? I've printed this out. I'd rather know which could be a potential shelter. You know, if I've got two potential shelters, one of them takes pets, I need to get there with my dog or whatever. I need to know, I'd at least try it. If I don't know that there are any, right, I won't be able to go there. But again, things like that would probably be something you put on your QGIS map. So it's just for you because we don't put those on the nerd as well. Yeah. All of us here are gonna have this much, we have more of a plan than people who haven't gone to this talk, right? I hope so. Thank you. But like, if I wanna tell my family members, suppose you do have internet connection during an earthquake or something, this is what you should be checking to see for all this GIS information or something. What should I be telling them to check? I wouldn't assume you'd be able to check any GIS information on the internet. You need, to me, everything goes analog as much as you can. You need a paper map. I mean, if it's a place you live in or a place you work in or you haven't recently changed your job or something, okay. But otherwise you need to go paper. Just tell them to print out something in a bin. You know, we've actually done too, as we've seen nerds bring in those really old California, you get them at thrift stores, the maps of the binding that we used to keep them to drive around with. Thank you. People are keeping notes. Like, is it in a pinch? That's better than nothing, right? And again, in a place like Southern California where you're driving a lot, because San Francisco, like, you're basically walking. Oh, I know what my other question was, is anybody here wearing open-toed shoes today? I know it's March, but like it's California, right? No? All right, very sensible. Because that's like another classic, right? High heels. High heels, yeah, in practical shoes. Yeah, because there's gonna be glass everywhere and you're gonna be walking. So the idea is that now, especially in California, if you're looking at North America, most of these buildings are gonna stay up, but the glass is not. So there's just gonna be glass everywhere, which means you're not gonna be driving, you're not gonna be bicycling, or using any kind of vehicle. You're gonna be walking. So in this kind of situation, I would imagine if you live around here or LA, you've got a pair of shoes in your car. But to answer your more broad question, there's a really great site called 72hours.org that has all this stuff about what you should have in your house. And it's great because it goes into things like, if you drink, have a bottle of wine, make sure you have board games, a pack of cards, a musical instrument. Like a lot of nerks put harmonicas in their go bags because it's boring. You're sitting around a lot, and that's all good stuff to have. But that's something you need to start thinking about, yeah. Other questions? Really quick, we actually have an OpenStreetMac project where I'm from, where we are mapping sidewalks. And we've had a couple of successful map-a-thons where we show them, it's like a 15 minute tutorial and then they're off and running. But I want to go back to the question about untrue stuff on the maps. I was under the impression that anyone can go make edits, but there's also kind of like a chain of command where people verify those edits. So I don't know if you mentioned that, so I wanted to make sure that was mentioned. Yeah, so on OpenStreetMap, yeah, there is a possibility to check one edit and ask for someone else to review it. And in fact, it gets reviewed quite often, even if you don't check those things, people will spot it. It's like on Wikipedia where you have editors that are passionate about a topic and they basically take ownership of a page on Wikipedia. At the same level, there is someone who becomes the editor of that block and they monitor everything that goes around. And if they don't see something that they don't like, even if sometimes, even if it's a legitimate edit, they're like, ah, this is not discarded. But the main principle with ResidencyMap is the fact that we rely so much on printed, that basically at the time when you print the map, that's the time where you make sure that all the features that you have added and that you need to have are actually real and true. And you know your neighborhood, you know your block and you correct things in case. Before you put that map for that trimester for that quarter inside your go-back. Any other questions? Sir? For sure, Pam was printing me. It kind of talks you through gates putting up about how to get out of where you want to leave or when the shell turns away. Depending on where you live to, you may be able to get text alerts, although in my experience, they don't really tell you where to go. They just tell you don't go, you know, avoid this area and hopefully we'll be able to get cellphone reception. That's not really a given, so yeah. But definitely also on the communication side, if you get the chance, look into the ham radio talks. I was really impressed with what they're doing with the ham radio internet stuff. I recently got my ham license and I feel some kind of like nerd Rubicon, but. Yeah, definitely. But I think it's interesting because again, how are we gonna communicate? We're so used to being able to contact each other instantly. All right, let's have another round of applause. Thank you. Thank you very much. The next talk will be, I think, is in 40 minutes. Testing. Hello, everybody. Thank you for coming to, were the penultimate open data talk? Ultimate, penultimate? All day tomorrow. But I think this is the last of the open data track or pen, oh, two more talks. All right, I'm behind. Well, thank you for middle of today. So this is governing via API, open source collaboration in city government. We're gonna be talking about the mobility data specification, scooters, open government, and how two jurisdictions can work together in delivering better services for our residents. So before I get started on the issue today, which is scooters, I sort of want to ask a question that I think a lot of people in this room have an answer for, which is what is code? So in one sense of the answer, the answer to the code is computer code. It's Python, it's JavaScript, it's SQL, it's all the stuff we do on computers. On the other hand, there's also this thing we call the law, the municipal code, in fact, which is the laws of the city of Los Angeles, the laws of the city of Santa Monica, laws of the state of California are all also encoded. So a little bit about me and my background sitting at the intersection of computers code and government code. I'm a senior data scientist at the city of Los Angeles. Before joining the city, I was at the Center for Data Science and Public Policy and a number of associated organizations. I've been an active member in my civic technology community. It's shy hack night and hack for LA. Now that I've been, since I moved back to here, you can stalk me on the internet on my website and GitHub, see all the fun stuff I've worked on. And my name is Hunter. And my name is Kagan. I'm the data officer at the city of Santa Monica. Been working there for quite some time, doing open data and open source work. And I also organize the Monday night hack night, every Monday night in Santa Monica for hack for LA. So come check us out if you're on the West side. Also stalk me online. So this is Los Angeles. Recently, a recent photo, I finally updated this to reflect our recent skyline and snowpack. Some things you may not know, we have about four million residents who live in the city of LA. Who in this room lives in the city of LA? Who in this room lives in the county of LA? Okay, so as you can know, there's a number of different jurisdictions. We service about 65,000 miles of streets every year. We handle 700, actually that's probably closer to 850,000 service requests a year. We have 46 different departments, everything from the airport, which is Los Angeles International Airport and the zoo, the port of Los Angeles, the zoo, sanitation, transportation, public works, what you name it. And then our neighbor, neighbor to the beach side. This is Santa Monica, a little bit smaller, a little bit sleepier, 93,000 residents. So, you know, a little bit smaller, 8.3 square miles. So you saw they had 6,000 street miles. We only have 8.3 square miles, but we do have a regional bus system that serves 16 million plus riders per year throughout the county. We have a police department, fire department. So we're full service city in Santa Monica. And we have the, I think it's the second or third largest budget of a city in Los Angeles County right behind Los Angeles themselves. So we're a fairly well resourced and regionally connected city though, on a much smaller scale. So typically when governments share information and when I work with folks like Kayin or we're trying to coordinate across, like the line at the city boundary is literally, it's a line, there's a wall there. You might hear that we do like public meetings and the public meetings for LA are totally different than the public meetings for Santa Monica and Long Beach. We'll all have our own Twitter, Facebook or websites. And one of the things that's really been happening is a shift from these human centered methods where we're really trying to communicate with humans which are super good and important. And there's a lot of ways we can do them better. This talk is not going to be about them. This talk is going to be about machine centered methods like open data, APIs and like metrics of government and why it's important to do standardization in this field. So I'm going to skip this. The big use case has been this mobility data specification that we've been putting together with our collaborators, City of Santa Monica. And the reason why is you've probably seen there's been a big change in the mobility market in Los Angeles, Uber and Lyft happened. Like this was not a driver of vehicle trips five years ago. And it now is a fairly massive one. And we have incredibly little data about that. Thank you, California Public Utilities Commission. But there's this massive change in how mobility is being delivered in the city. The city of LA, we've launched an electric car sharing service. If you live in the city, check it out. It's mostly in center city right now. It's expanding into the valley shortly. So blue LA, you can rent an electric car by the hour. So this is the station in Westlake. And then there was this explosion of scooters, dockless bikes, e-bikes, the wheels. Yeah, there's now like a scooter, moped, and hybrid. There's a lot of dockless devices that operate in the public right of way. And then we also have future disruptions coming. This is Elon Musk who wants to build a tunnel over both under our jurisdictions, see how that goes. And we know something else is happening with the mobility market. We know that we're gonna start seeing drones. We're gonna start seeing autonomous vehicles. And this is a different model than what has historically been presented in our departments of transportation and requires a different skill set. So the question I got tasked with by our department of transportation was some cities chose to do pretty small. If you're the city of Chicago, the entire city of Chicago, which is as big as, almost as big as the city of LA, only permitted 1,500 total scooters and dockless bikes. So that's one way to solve the problem. Like a lot of these responded to this by saying, okay, we're just gonna run a real small pilot and just observe what happens in like one neighborhood, keep it really tiny. Our city council was not very interested in that. I think neither was Santa Monica's as well. We're interested in saying like, look, these things are good in a sense. Like people might not be driving. They can get around easily. Like we don't wanna restrict innovation. There's all sorts of things. But the technical question that sort of eventually gets distilled from us from our transportation and mobility folks is, how do you support hundreds of thousands of devices with different types in hundreds of different cities potentially? There's 88 alone in LA County with an infinite supply of edge cases. These things move. They go between our jurisdictions and other jurisdictions. And then finally, this is a map of Los Angeles with all the different 88 cities in the county and some of the extended counties. All of these are gonna pass different rules about how these operate. And it's like, all right, how do we communicate it? How can we make sure that folks are operating in a safe and sane manner? So we being the enterprising programmers that we are turned to GitHub. And we said, well, what if instead of saying you have to report, when you're a taxi company in the city of LA, you have to report your metrics on a floppy disk? I'm not joking. We got our metrics from these taxi companies every month on a floppy disk. We're gonna switch that. So we set up the mobility data specification on GitHub. We opened it up. We've received a number of contributions and other cities adopting it to say like, look, this is not your policies. This is not setting how many scooters you would like. This is not setting where they're allowed to go. It's not setting what your performance targets are, what your equity zones are, but rather a way of getting information from these companies and pushing information out to these companies that is standardized so everybody can implement their policies and actually see it be followed. Because we're all publishing in a slightly different format. These companies are just gonna ignore our rules. So we open sourced it a week ahead of actually announcing this to the city council and going before with our regulations. We just released our 0.3.0 version, so we're continuing to iterate pretty quickly on this. I think we've received over, there's over 40 unique contributors from a half dozen different cities and all the, you know, most of the major companies in the space, Bird, Lime, Spin, Uber, Lyft. And we split this into like two API definitions. So one is the provider API, which is what we're mostly gonna be talking about. And this is a simple way of representing what has happened on the public right of way. And then the agency API is sort of our way of expressing rules that the city owns. Let's think of it as like a digital street light. Like, are you allowed to park there or not? Just like we put up a street sign, we're now gonna put that up in an API format. The architecture allows us to really scale this horizontally, you get this like nice sharding pattern of, you know, there's a number of different, I don't really care about the relationship between you and your mobility provider and their specific devices. I really just care about what's been going on in the public right of way, because it's our responsibility to ensure that the public right of way is safe. And then so you can see how data flows from these companies, both to you, represented by this cell phone over here, and to the cities represented by Ferris wheel and the city hall thing. And this is sort of the architecture we've sort of proposed for how mobility as a service companies can more easily interface with city halls. So diving into the provider API a little bit, and let's talk about what this data actually looks like. So there's two primary components to it. There's status changes and trips. And if you think about a scooter, an e-bike, a device like that, it kind of has a life cycle of events throughout the day or throughout the week. You can imagine that in the morning, the company puts it out on the street, and then maybe a couple hours later, a user comes and picks it up, they rent the device, they take a trip, they go somewhere, at the end of their trip, they drop it off, they put it back on the street corner. Maybe later that night, the company comes and picks it up, they charge it, then the next morning, they put it back out. So there's a series of events that occurs to each of these devices throughout the day or throughout the time. And that's what status changes. It's each individual event that occurs on a device. And we'll get into why this is important and how we're using this data in a minute. It's just important to understand. So we've got basic things, the time, the location that the event occurred, the ID of the device, obviously the company that owns the device, some information about the type of event it was. There's different types of events, right? So this, you can think of just kind of an event stream. And then what's probably more familiar is the trip data. So this is where the trip started, the path it took, the time that was, the user was on the trip, maybe the cost of the trip. And what's important to understand with trips, and when we're looking at this data, is there's a distinction that we make between the vehicle and the device. So the vehicle is the physical scooter or bike that you see on the street. It's the thing that you actually pick up and ride. And as we probably all know, these things are constantly being turned in and out of service, right? They're buying them in cheap by the boat load and deploying them on the streets and the tires wear out or whatever. So they change the vehicles, but separately from the vehicles we have the devices. And these are the actual units that are transmitting GPS data. And so we wanna get an idea from the companies of how many vehicles they're putting on the street versus how many devices kind of get an idea of their operation. So as Hunter mentioned, we both been working on this primarily on GitHub. And Los Angeles and Santa Monica have not only been working on the specification itself, but a lot of the code that goes into how do we implement the specification from a city side. So Santa Monica released the MDS provider, Python library on GitHub. And we've used that, Los Angeles is using that. It's the framework that we use to ingest the data from the scooter companies and the bicycle companies stored in a standard format. And then each of us kind of has our own rules and regulations like Hunter mentioned, LA council has a lot of equity zones in their own kind of council districts. Santa Monica has other rules and regulations. So we implement that piece of the project separately, but we use a common data standard in the middle and a common Python library to collect the data in the middle. So what's really nice about having a data specification is you can validate the data that you're getting from the companies against the standard, against the specification. One of the first things that we did when we were about to launch the specification 0.1.0 is we took it from the nice, friendly, human friendly markdown file that we can all read and turn it into a JSON spec document that we can actually validate the data against. And so you're seeing some output from a data validation there. So every time we collect data, which is on the hour we're collecting data from the companies, we're validating exactly what we get against the spec, making sure that it passes validation before we stick it in the database so that we're getting valid data from them. Briefly interlude, I realize you are not all necessarily people who have worked in government in the past, but data validation in government is such a huge pain point. Like if you've ever been munging through spreadsheets for days of your life and I have like because, oh, every company interpreted like, or every permit application or every house interpreted number of units, like this person put the number for rooms, they put the number of rooms in the building. We have this huge problem for me where when we collected data about every rental unit, our rent registry data, some property owners put the number of rooms in the building and some property owners put the number of bedrooms in the rooms field. And we, so eventually I feel like that right this rule to detect which one I thought it was putting in and filter it into the category because there's 100,000 plus rental units to look through, but without by able to do data validation in real time by defining the spec in advance, like that problem has been essentially solved for us and has allowed us to get to analysis at a faster rate than I have done in any of my previous government data projects, which has been really cool to see. Yeah, and it's worth reiterating that both LA and Santa Monica were interested in kind of the marketplace of ideas and innovation in companies that can come out of this. So it's important for us to be able to collect data from a wide range of companies. They all have different operation styles. You know, some of them operate 24 hours a day. Some of them are just kind of during the daytime. So the specification allowed us to set up, here's what we're expecting to get from you, how you operate in the back end, how that data is produced, that's up to you and that's kind of your secret sauce and we hope that the best company wins in that regard. But like Hunter mentioned, when we're getting data as a government entity for the purposes of regulation for holding the law together, it's important that we can validate that. So typically this takes the form for us. We have a dashboard and reporting mechanisms. So this is a sample dashboard that allows us to show to our council districts what's been going on. And then we also use this to power integrations with our through and one system about when scooters have been out on the street for too long, like after a certain period of time we might notify a company that they have to remove a device. So we've been working on making sure that this is sort of more a real-time framework for making sure that the public right of way is safe rather than necessarily saying like, oh, we looked at this data for a year and released a report. We're trying to really make sure we're getting into an agile cycle of continuous improvement. We're also, now that as we're launching our own agency APIs, really starting to do developer support and like writing documentation for city APIs, which has been really exciting. I think the big meet here is like are these, how are these hitting these performance targets? I think one of the big challenges we faced is these things move, they move across jurisdictions. So most jurisdictions have some sort of cap in place. Like you are only allowed to operate and say a hundred devices, but is that they can deploy a hundred devices? What happens when they cross our very poorest municipal boundary? What happens when they're lost and stolen? Like what's going on? So one of the things that you think is simple in the government code, like they wrote like, each company has a cap, it's like one line in each of our, I mean, in the government code, is actually hundreds of lines of computer code to figure out that number. So I'm gonna let Katie talk a little about how we sort of calculate those cap numbers. Oh, this is, I kind of explained this, but one of my favorite blog posts, if you've never read this, why counting is the hardest thing in data science, highly recommend it. We'll post the slides via Twitter and to scale so you can find this link, but I highly recommend you check this post out. But yeah, so they move in and out of boundaries and all of you kind of delve into the weeds here. Sure. Yeah, this is where we're definitely gonna get into a little bit of the math and the technical side of this talk. So again, each of our programs has set a limit for each company, the number of devices that they're allowed to put out on the street. And that's kind of the primary, at least for right now, as we're working through this, how this is gonna look into the future. In Santa Monica and LA, it's kind of the primary metric that we're looking at is how many devices are out on the street. It's the primary concern that we hear from our residents that they're littering the sidewalk. We can't get down the ADA ramp. They're blocking my stroller path and that kind of thing. So that's one of the things we care about the most and it's one of the things that we need to use this data for. So we've been collecting data from the company, status changes and trips. Now what do we do with that data? How do we actually analyze it to count? So if you're going back to what we're talking about before, we've got this series of events that occurs throughout the day for a device and then we also have trips thrown in the middle there. So we want to first reconstruct from the series of events and those trips what we call the activity windows. And there are two types of activity windows. There's the windows of time where a device was inactive. It was sitting on the corner, waiting for a user to come by and pick it up, rent it. And then there are the active windows. That's the time it was on the trip, right? So we need to look at both of these because as Hunter mentioned, it's not that you get to deploy a thousand devices every single day and you just get to put a thousand more every single day. It's you are allowed to have as the company, for example, a thousand in operation in the city. So that's between what you're deploying, between what's being written in, being written out. So we need to account for all of this movement throughout time across whatever day or whatever time period we're looking at, if it's a day or a week or a month or something like that. So we'll talk about each of these types of windows and how we calculate that. So we'll start with the inactive windows. And again, this is just the time that the device is sitting on the street. So you imagine in the morning it's deployed and before the first trip starts, there's some window of time where it's just sitting there on the street, okay? And then maybe a trip starts and then it ends and then it's sitting on the street again until it's picked up later in the evening or something like that. So we take every single device from every single company and reconstruct these inactive windows. And then similarly for the active windows, time that the device was on the trip, right? And this seems pretty intuitive. The trip started, the trip ended, there's your window of time. But it does get tricky when we start talking about these municipal boundaries that are so non-obvious and porous for a standard user. They don't really care if they're starting a trip in Venice and cruising up through Santa Monica out to West LA. They're not considering the municipal boundaries that they're crossing, right? But I can't count in Santa Monica, I can't count scooters that are in Venice against the company because they're in Venice. They're not in my jurisdictional boundary. So we need to consider the portion of the trip that was within the boundary. And whether it started in Santa Monica and then eventually left or it started outside the city and eventually came in, we need to look at each individual point in the trip and figure out when was the first or the last point within the boundary that I care about. That could be one of the, how many council districts, 15 council districts? So just to make this geography question even more fun is I actually have three different types of geographies plus 15 districts. So we have a combined total of 45 different sub-jurisdictions in which these companies are counted against rather than just what like you think it's oh, it's one blob, right? And I'm like, no, it's actually 45 separate blobs. When you multiply the equity zones times the number of council, we have three different types of equity zones times the number of council districts. So building robust software to do these calculations has been really crucial and would not have been possible without open source software like post GIS and Python. And again, in Santa Monica, it's a little smaller. We just have, are you in the city or not? Fairly easy. But again, it's the same type of problem at a smaller scale. And of course we're doing this open source so we can scale this up to cities like Los Angeles and other larger areas. So we take the portion of the trip that occurred within that boundary and maybe it was the entire trip. It's the entire trip occurred in Santa Monica or maybe it's just a portion of it. So we've got all these inactive windows, all the active windows, presumably this describes all the activity for every device. So we take the union of those and now we've got a series of windows, right? A series of availability windows, we call them. And it's a window of time that a device was in the public right away, whether it was being ridden or whether it was sitting there, it was in the public right away. Okay, so now we've got the windows, we have to count them. And we need to do this in a way that accounts for the period of time that we're looking at. So both of our regulations specify a certain number of devices per day. So it's 750 scooters per day in Santa Monica. So this period of time could be a day or it could be we wanna see the weekly average or the monthly average or something like that. So whatever it is, we take that time period and we partition it into smaller time periods. So some of these could be three hours, 30 seconds, one second, whatever it is, it's basically what the data tells us to partition the time period into. And then we take all those windows. The thing that does the partition is, right. Which is add things to come in. Right, yeah, so we're not just randomly, I'm not just picking this partition. The data is describing the partition itself and kind of implied by the data. So we take all those windows that we calculated, the inactive windows, the active windows and we overlay them on our partition. And we essentially count how many, this segment had five windows overlaid on it so it gets a count of five. This segment had two so it gets a count of two. Here's where the math comes in. So we can calculate the area of one of these segments and we represent the area, the width, the number of seconds in the segment. So 30 seconds, 60 seconds, 3,600 seconds, whatever it is. And the height is the count that we calculated over that segment of time. The number of availability windows that crossed that segment. So you can imagine if the segment is an hour long and there were 60 devices, 60 windows that crossed that. You can calculate that. And now we do a little bit of calculus. If you remember from your calculus class in high school, the Riemann sum, anybody remember that one? We just add all these together and divide by the total length of time in our period. So if it's a day, that's 86,400 seconds. If it's a year, whatever it is. And that gives us an average availability calculation over the time period that we're looking at. And again, this is a generic algorithm so P could be an hour, a day, a week, a month, three months, whatever it is. And this again is the primary metric that we're looking at right now with these companies is ensuring that they're not flooding our streets with thousands and thousands and thousands of devices. Yeah, so hopefully this sort of untax the complexity behind one line of government code is actually a lot harder in computer code land and how important it is to really think hard and robustness about this. This is, I really, one of the things is like, no one of us could have done this in isolation. We would have all been stuck cleaning and munging this data. Aside from Kagan and myself, the city of Austin and the city of San Diego have both released some pretty cool, like if you want visualization tools, you can just, there's this Docless Data Explorer that Austin released. I know it's on our roadmap to port this for Angelinos. So we're really excited, you know, we're gonna get that out and it's gonna, we're not gonna have built it. Like we're just gonna be able to fork it and adjust the city boundary on it and get it up and running in weeks instead of months and they really do a kick-ass job visualizing it. You know, we can work on, we've been working really hard on the availability because that's a key, you know, component for our council. But it allows us to collaborate on this open source stack of geo-processing tools across cities, even though we're only one, maybe two people per city, each of us working part-time on many projects. It really actually feels like it's a cohesive project ecosystem that is spun up around handling Docless no mobility. And I think what really is next is growing this ecosystem to support more and more mobility, you know, car share, TNCs, taxis, and really especially as we go to AVs because it's a lot different when you're operating a personal vehicle, that's your personal choice, but when companies are releasing these huge massive fleets, like they're asking for permission to do this huge disruption on these streets and like it needs to follow the democratic process and rules, and as our local governments, like you'll have an opportunity to go out and do public comment as we change this, these regulations to adapt to these modes. So I highly encourage you to get involved, both on the technical side and on the policy side. It's a really fun, wonky issue that's just the future of transportation, apparently. As I think we've all been a little surprised by how important this is, but it's been really cool to see how open sourcing government can really deliver better results to constituents pretty quickly. If you're interested in learning more about how code works at the city of Los Angeles, we have approximately 2,500 people who code every day or do computer related stuff for their day jobs. So there's a lot of us, we have an active presence on GitHub at github.com slash city of Los Angeles. We have both a geodata portal at geohub.lacity.org and a regular data portal at data.lacity.org where you can, there are over 500 data sets on each. So you can go out and learn a ton about what data we've collected about the city. My team runs a partnership program if you're in a university called the Data Science Federation, where we do projects and partnerships with local universities. So if you're at a university, please do look for the professors who are involved. I think we've got pretty much every local university covered at this point. So there's somebody you can find more on our website. Just a little sense of like the diversity. I highly encourage you like, there is a huge retirement wave coming. It's really satisfying every day to go in and work on problems that matter for people. You can learn a little bit of our stack. You know, we use everything from Python, Jupyter, Notebook, R, JavaScript, Drupal, we have mainframes. My department is responsible for helicopter radio maintenance. So it is entirely possible if you wanted to to be spending your time making sure that the computers and helicopters don't crash and then the helicopters don't crash. So it's a really interesting field and I especially encourage you to come apply. You can just search government jobs at LA city and find out more. And we also have a robust offering of open source and open data online, github.com slash city Santa Monica, data.smgov.net. We also have a geo hub like LA. We don't have a nice clean URL. So I'll point you to open.smgov.net which has links to pretty much everything we do open source and open data. So just check out open.smgov.net. I mentioned we run the big blue bus system. You've probably heard of that. We make all their data available, GTFS and real time. So if you're using Google Maps and you see the real time arrival of a big blue bus, it's cause the team publishes the data. So yeah, check us out online. We don't have any job offerings right now. I don't think, but we definitely are always willing to collaborate on code and data online. That's about it for us. Thank you. So we'll take questions and we'll pass around the mic so we can get it on recording, please. Yeah, one question. You mentioned that part of the reason why you validate the info is that the laws are upheld. That was almost what you said. Is there any way that you are checking where those scooters are actually driving and where they are parked? Because theoretically they're all supposed to ride on the street, but barely anybody does and a lot of them driving way too fast without any consideration for pedestrians so that there's kind of like a penalty for the companies if they don't keep their riders basically in check and basically hold them accountable for that. You guys have the headsets, so I don't have that. So really great question, totally relevant question. Yeah, the sidewalk riding, I think what you're referencing is a huge concern for our communities. So in terms of where the devices are being parked, yes, that is available in the data because we get an event that says a device was left here by the company or by a user and that is something we're checking. Hunter mentioned equity zones, Santa Monica, we don't want all of the devices concentrated in the downtown because that's where everybody goes, that's where we expect a lot of the rides to happen but we want this to be equitable for the entire city. So we do have targets that we're looking at for each of the companies and Hunter. Yeah, so one of the just blunt things is like there is a limit in how accurate GPS is. Like it is impossible to tell in a GPS trace like was somebody on the sidewalk or was somebody on the street like with any degree of certainty and one of the things is we're also not really interested in penalizing individual riders. The choice that they made to ride on the sidewalk is generally because they feel unsafe riding in the street. So one of the things we're looking into doing is really expanding our avail. One thing is there was a survey that they did, they did some count surveying in Santa Monica that showed on streets with better infrastructure on the street. There was far, far lower incidences of sidewalk riding. So we're hoping that we can use this data to figure out where we need to change our streetscapes to reduce sidewalk riding as much as possible. Hi, this is great stuff. Thanks for doing it. I live in Orange County actually and we're on the tail end compared to Santa Monica and LA and LA County. A night, full disclosure, I've been a contractor for a couple of scooter companies in the past and it's interesting stuff. So two questions. One is when will you have an API endpoint with all this data that developers can connect to and create some other cool stuff for taxpayers and consumers? Like for example, I can see this, there's a dozen scooter operators and LA it looks like in Santa Monica. Consumers, taxpayers could have an app, a mobile app where they can see all the scooters available from all the companies when then a quarter mile walk of where they are. So that would be really interesting and useful. I look forward to it eagerly when you allow, when you have an API endpoint for the data you're collecting. And the second question is so what is the women on a number of cars in LA or Santa Monica? And I guess if there's a woman on a number of scooters and there is and I get to understand it and it's a political reaction partly. When we have a discussion about how many cars are owned by individuals or leased by companies in LA County and the problems they cause and how they compete with scooters say for parking spaces that could be used by scooters in the public right away. It's a great question. To the second part, I think I just saw a headline was it in the LA Times yesterday or two days ago about how LA regionally really needs to start thinking about how to either charge, driving more or so I think it's a relevant point and we're not quite there yet with MDS. This data spec so far has been focused on the micro-mobility, the shared mobility. However, there are conversations ongoing right now about how to, on GitHub, yeah. And it was always kind of the idea to expand MDS to cover Uber's, Lyft, taxi, that kind of thing. So we can start to get an idea of how many vehicles are on the streets. To the first part of your question though. Well, one, just a quick addition. One of the reasons, just like a quick, I can hear the policy person I work with telling me this answer from, they did not want to spend Saturday at a Linux conference, shockingly, but you got us instead. But we're really interested in the governance and historically the city has had strong interest in the governance of fleets. Governance of individual, like if you want to buy a scooter off of Alibaba or Amazon, like it's not the city's prerogative to really, but if you wanna leave things in the public right of way, like ambulance, like when ambulance has come in, when taxis come in, when TNCs come in, the government has always had a much stronger role in regulating and the capping that number. If you wanna like, you can Google the somewhat sort of history of TNCs and attempts by cities to regulate them, and how we've been somewhat preempted by state law on that one. But on making sure that this data is available to consumers is, as part of complying with MDS, every one of these companies is making available their GBFS endpoints. That's a public-facing information feed about where these are at any given time. Some of the information contained in the MDS endpoints that the city consumes is sensitive, you know, somewhat sensitive information. It's not super personally identifiable, but we don't necessarily want that. But in terms of like being able to build app integrations, we do wanna make sure that there's an open marketplace for people to come in and work with GBFS style data about what's out there in real time versus more historic. A lot of the MDS stuff we talked about was about like more reporting and analysis, but there's a lot of really great stuff you can do with real-time stuff with GBFS and all these companies have that. Every company that is operating in LA, I can tell you, as a condition of their permit has shown me a working GBFS endpoint. We're gonna be issuing permits next week, and so that list will be published online. Yeah, so I guess what I mean rather is the dozen companies you're pulling in data from, right? So yeah, I get they all have, I guess, publicly accessible in theory endpoints you're pulling from, and I could too. But you're both pulling in all the data from all dozen companies, why not or when rather will you create one endpoint from each of your cities where developers can connect and pull in all that data in one big feed rather than having to connect to a dozen or more endpoints from each contractor or provider, so. So this is one where I'm gonna say that things you should never say as a government employee, which is that our peer cities are ahead of us on this one. So both the cities of Austin and Louisville have used this to release this. We are working on it. We are just resource constrained. They permitted three companies. I have 11 to go through and write integrations with. So it just takes time. But you can actually see all the trips as I showed that little dashboard that Austin put together, but you can go download all the Austin trips available as well as Louisville. I think the Louisville data actually ended up in the New York Times story recently. So yeah, so here's their sort of data. And then this is just available. We're gonna make some amount of data. We're still figuring, you know, there's process. It needs to be time, but we're gonna make it available at data.allycity.org at some point. I have a quick question, which I'm gonna use my role as moderator to just hop in right here and totally hop the line. Yeah, no, I'm cheating, straight up cheating. My question is, do you track crashes and do you cross link those to the hospital reports so that you know, because these are self reported crashes, right? So you know that you're going to have at least certain amount of injury based on what the hospitals are telling you. Yeah, that's a great question. And one that comes up a lot, right? The public safety aspect of all of this. I think that's what a lot of our jobs are about is ensuring that these devices are used safely and our streets are traversed safely. So yes, our police department, if they respond to a call, whether that's a crash or something happened and it involved a scooter, they take a mark that they note that there was a scooter involved in that. So we're collecting data on injuries from scooters, the citations that we're giving, both the users and the companies. When it comes to merging that with hospital records and it gets a little dicey and this is kind of, you know, pulling up the blankets of government, the police department, believe it or not, does not always share data with the fire department. And the fire department is often responsible for transport to the hospital or even providing some medical care right there at the scene. And so linking that data together is a little bit more difficult than just, oh, we know a crash occurred at this intersection at this time, we should be able to pull up those records. Not quite as easy as that, but it's definitely, yeah. You can't track rapid sudden deceleration in the GPS signal? I mean, that seems like a better, I don't know. Yeah, I was just curious, are you guys sharing anything with like open street maps so people can look up like where the scooter stands are or whatever, you know? Cause you said like you can find the data feed, you can see where they are in the city and you can see where the operators are, but I don't know if you guys, like you know where a taxi stand is at the airport, but I don't know if you know where the rental stations are or whatever, so people can look up at the layer of their open street map, where they are. Part of the issue with that is the nature of these operations are that they're dockless, right? So we don't have a list of here the stations. Now some cities, I was just talking with Paris France the other day, are looking at creating specific parking zones and kind of that's the model that they wanna operate under. You're not allowed to park wherever you want. There's a bunch of zones and so in that model, I think probably they would be looking at how much parking is occurring there. In Santa Monica, we've put out a number of drop zones that the city has painted kind of, you know, squares on the cement. We're asking the companies to figure out ways to incentivize their users to park there, but it's not a requirement, you know. We want this to be as free-floating and as useful as possible. Yeah, we're about to pilot our, I think we're gonna go stencil next month, number of drop zones as well. I don't think there's one of the things that I would like we're gonna figure out, especially on the agency sides of APIs, how can we express all of our drop zones, whether you're Paris, France is like, you can drop here, you cannot drop elsewhere, whether you're ours, which is in downtown, you need to use a drop zone, but outside of downtown, or if you're more than 0.2 miles away from a drop zone, you don't like, all these rules are like getting increasingly complex and like publishing that as, and sort of what I kinda hinted on the thing that's more in development is this agency API, or if you're Santa Monica's where you have, we want an incentive in the drop zone, you know, these are all different policy responses, neither of us are policy people, and should not be asked to like decide these things, but our policy people have come up with different opinions. So we're just trying to figure out, well, how do we make sure that the technical work, whether it's integration with OSM, integration with these apps itself, integration with future third party, you know, stuff like transit or Google Maps, that any constituent doesn't really have to know the laws of eight, whatever city they're in, can just use their mobility device. And that's the beauty of this technical spec. So you've mentioned one time in response to the question about making data accessible to just general developers, you talked about some data that was potentially sensitive, and I'm curious what your sort of privacy policies are. For example, for some of the trip data, do you only store it for a certain amount of time, or do you have any policies on how to both protect and limit the storage of this data that is potentially identifying? Which has been, I'm sure you're aware, there's been a lot of very interesting research about taking this data, trip data in particular, that seems like it might be anonymous at first, but then tying it to individuals. So the city of Los Angeles has an information handling guidelines policy is promulgated, so we have this huge, we have an information security policy, we have a chief cybersecurity officer. It is like a fairly robust team of people who work on this and just tell me what to do, and I trust them to do their job, and they do a great job. With regards to re-identification, so essentially that's one of the things is we have, we're, the administration and government has a retention schedule, subject to typically your open records law, which in this state is the California Public Records Act. I can see like my government IT peers like nodding along about this, because this will like dictate how long you have to store emails for, how long you have to store records for. So we will set a retention schedule for this. I do not know what it is going to be yet, because we only started getting data from this in September. So it's not, typically that retention schedule is generally two years, but there are exemptions in certain laws. The other thing to be aware of is like, we don't necessarily share like this freely across things. We've done a lot of work to try and limit our risk while preserving the public safety aspect. So obviously we're not going to collect like, it'd be interesting from like a planning perspective to get people to use names and like see all their trips that they took or their zip codes. We're not getting that, because we immediately said like, that's a bad idea, let's avoid that. In terms of longer term like preventing re-identification, one is being careful when we do an open data release to make sure that re-identification isn't easy. You saw this when like NYC Taxi did that. The other thing is re-identification always requires some other piece of information. So it's making sure that there aren't any obvious other sources to join, especially given the shared nature of these bases. A lot of that research is based on devices that are not shared. Typically a lot of the papers I've seen have been phone traces or personal automobiles is what they've been reconstructing with. The papers I've seen, you get into this like re-identification is a probability and I like can talk people's ears off about this, but at the end of the day, we think the risk is substantially lower based on the fact that these are shared devices rather than personally operated devices. But we're gonna be very careful about this issue because we again operate in the public's trust. And I'll just point again to this Austin work. They did a lot of work. Obviously they're releasing trip data here. So this is the sensitive piece of it, but they're anonymizing the trip, the location paths actually. So they published a lot of the code and how they're doing this. And I think we're all kind of trying to figure out if and when we do release this data publicly, just generally publicly, how do we do it in the safest way so that we're not risking re-identification or other things like that? Yeah, it's definitely, we have these conversations a lot, particularly with the companies and we like to remind them that cities are not, this isn't our first foray into sensitive data, right? Like we have license plate readers. They operate the LAX airport. Like there's all kinds of sensitive data that cities deal with every single day. I have a colleague who operates a system for our police department that is the list of domestic violence abuse survivors, where they experienced abuse and where their abuser lives so that they can respond in an emergency as quickly as possible. That is so much more sensitive that when I went to our cybersecurity officer with like this re-identification question, he almost lasted me. Cause he was just like, I'm used to holding stuff that is like, truly if this got out, like people's lives would be at risk. And that is, you want government in the business of solving problems where people's lives are at risk because we're owned by you. Like it's a democracy. Like you're here. We're here to act in the public's trust. So, you know, again, we're treating this seriously, but this is not the first time that any of us has seen super sensitive data in our organizations. I was actually curious about what your data flow, I mean, how much data are each of you getting for a day? And if you've looked at formats other than Jason. So the city of LA is about to permit 40,000 or so devices that we'll be launching in the next, we're going to be completing this permitting cycle by March 15th. Each of those typically generates on order of five-ish events a day, depending on use, maybe depending on the company and their operating model. We do not store the JSON. We only really sample every 12 hours worth of JSON. We store it in Postgres is our stack and we do use Postgres and PostGIS. So we do some ETL. JSON is very good for the interchange aspect, but for like long-term storage, it is not, we're not just like infinite blobs of JSON. I think we actually do keep it for historical because just throwing it on S3 is cheap and certain companies have changed their data after the fact, so it's very useful to keep time-stamped records of what they sent you. So always saving things is cheap, but yeah, our stack is pretty much hit the API every hour, get 12 hours worth of data, ETL it into Postgres, and then everything runs on top of an ETL version of the data. And our stack is very similar and I was just looking at kind of our complete data set so I have some numbers in my head. So we've been collecting data since I think, let's just say the beginning of November, so from November 1st through the end of February, Santa Monica has four companies. They're each permitted around 750 devices. So for that time period, for that amount of companies and devices, we have approximately 700,000 trip records and approximately 4 million or so status change records. Four months, four million records, so about a million records per month across all four companies. So it's not a terrible amount of data. One of the things that's challenging, and this is kind of where my, I've run up to my knowledge on Postgres and PostGIS and I've had to do a lot of research, is when we're looking at the trip paths, of course, each point is its own data point with the time and the location, all that stuff. So when you're unrolling a trip path and trying to do something like classify all the points and which ones are in downtown, which ones are in, that becomes a processing task. So I think, if I can remember the numbers correctly, we have, once we unroll all those trip points, it's about 40 gigs worth of just trip point data, aside from the other. So it's a considerable amount, but not unwieldy yet. There's also, I kind of reference this in the architecture. We got concerns from companies when we started this, that this would not be able to scale. And then we sort of explained to them, well, we have this nice sharding strategy called there's 88 jurisdictions in LA County. And then there's 50 counties in the state. And like, there's a very simple way to shard this into ways that the data size is manageable, which is nobody has the global picture, except for these companies. And you know what, they can hire engineers who make more money than me to solve that problem. Yeah, since we don't have any fixed drop zones in other places, so the question I have is, have you come across any lawsuit or anything which has been filed on account of a disabled person having an accident with these lying around? If it happens, who is liable? Yeah, not a lawyer. Somebody will do it. We'll get precedent. It will happen, I think. Any other questions? All right, let's give another round of applause for them. And the next presentation is in 34 minutes. Test, test, test, okay. Test, test. Hello, hello, can you guys all hear me? Looks like it. I see some heads popping up. We'll give it a few more minutes in case people are running a little late. It's 4.30 right now, but rather than start right on the dot, we can, I see some people trickling in. We'll give it till 4.32. I feel like two minutes is a good, you know, slob. Say it again. I'll get the rerun on YouTube. They could, they could, you know. But, you know, some people like to be there for the live event. It's like I work at Ticketmaster and I think about these things. What are those Ticketmaster people at scale? Never heard of them. What do they do? They're here? What? So, yeah, I'll just get started. My name is Mark Rodin. I'm the moderator for the open data and the open government tracks. And I have been trying to organize all these talks that you guys have been seeing for Thursday, today, tomorrow. And one of the things that, one of this talk came out of a very lengthy conversation that Dean and Kenneth and I and Erin as well in the back had over open voting in Los Angeles. And what happened was, about six months ago, the LA Times published a very quick snippet article about three paragraphs long that said, Los Angeles County is going to be making their voting system open source. And that was about it, that was all they said. And I said, well, that's really interesting. And one of the kids that my son goes to school with, his mom works for the LA Times. And so I reached out to her and said, who do I talk to? And she said, hold on a second. And like two weeks later, she was like, I finally tracked it down. You're gonna talk to this guy, Dean Logan. Here's his email address and just go to town. So I was like, okay, so I called up Dean and I said, hey Dean, I hear you guys are gonna be open sourcing your voting software. And he said, yeah, have you heard we have an election in November? I'll talk to you after that. So that was pretty cool. And I could be sympathetic to that. I had heard that there was an election too. So I was pretty good with waiting. After the election happened, I reached out to them again and they were like, what just happened? We just got blown up. But yeah, and then they met with me and Chris Smith is actually sharing the development track here as well. And the two of us, we sat down and our first question was, okay, so you wanna open source the software? Why don't you just throw up a tarball on GitHub? And they looked at us like we were retarded, just like we did not understand anything about the world. And they said, well, what happens if there's a bug found? What happens if somebody tries to steal the software? What happens if somebody expects us to support it in the state of Oklahoma and we're not Oklahoma, we're a county like there's all sorts of laws and we don't have any support. And I was like, oh, this is all actually really complicated and really hard. And I had no idea just how complicated and hard the situation actually was going to be. But one of the things that was actually an interesting confluence of events was the way in which they were describing their system and how they wanted it to work. And Chris and I sat there and we said, so what you're trying to do is you're trying to make sure that one person goes to an event and gets to get in once and then they get to experience what happens in that event and then they leave and nobody else gets to do it a second time. And I said, yeah, I was like, so that sounds a lot like ticketing. And they said, oh yeah, I guess that is a lot like ticketing. And so that sort of formed the foundation for us to be able to talk about a lot of things because then we could start to understand just sort of the relationship that people would hopefully have with this new system that these guys have helped to devise and develop over the course of the last little while. So we should actually back up a little bit and actually, so we wanted to do this as a Q and A type of session where I was going to ask these two guys and hopefully shed some light on this whole process and hope to bring this into the open source community and see where this can go. But that doesn't preclude you guys from asking questions. The only thing is is that we only have two microphones. The one I'm wearing now and the one they're gonna be swapping between the two of them. So while we're recording, if you guys have questions and not totally interrupting the flow of everything, just raise your hand and then we can take the mic out and then I gotta bring it back. So it's a bit of a back and forth but that's what we got. So I think we really should start with asking and I guess I don't know which one of you wants to handle this but why did you guys feel that Los Angeles County needed to have a new voting system? I'll take that one. So thanks Mark for having us here and I did not recall that full story so that was great introduction. So I'm Dean Logan. I'm the registrar recorder County Clerk for Los Angeles County. And so I can sort of set the context for the project that we're talking about. Kenneth is our project director and senior member of our IT staff in the department. So when it comes to the actual part of this that I'm guessing most of you are interested in in terms of the programming, Kenneth can take most of those questions. I also wanted to point out just real quickly, Steve Vo who is here from Digital Foundry. Digital Foundry is our partner for the development of the tally system component of our new system and then Aaron Navarra's in the back is our government and legislative affairs manager in the office and a key part of the project as well. So just a little bit of context that the nexus of this or the genesis of this project really goes back all the way back to the 2000 presidential election, the infamous Bush v. Gore election where we as a nation saw the results of a razor thin outcome in the presidential contest and we began to see very vividly the warts on the process of how we conduct elections and in particular how ballots are counted and votes are reported across the country. At that time, for those of you who are around and who were watching what's happening on that, the focus was really on Florida because that's where the ultimate outcome was determined by the votes in Florida. Again, not just razor thin nationwide but razor thin in the state of Florida and much of the blame or the diagnosis after that election was based on the use of punch card technology as the means of voting in Florida and the flaws with that. The infamous dimpled or pregnant chads and whether or not operationally they had checked those cards and if there were votes that were dislodged or votes that didn't get counted. So what followed, I'll try to make this quick to get it in context of what we're doing here in Los Angeles County but what followed was a historic change in policy in the country where Congress for the first time in the nation's history decided that the federal government would make an investment into the infrastructure for elections and the focus again was on replacing punch card voting systems because that was the primary method of voting that was used throughout the country. Not the only one but it was the primary one and it was seen as the problematic piece in the 2020 election. So Congress appropriated a whole lot of money that was distributed to the states in 2004 under a piece of legislation called the Help America Vote Act. There were constraints with that money that included things like making sure that every polling place had at least one piece of voting equipment that was accessible to voters with disabilities for their ability to vote independently and without the assistance of another individual if they desired. There were a number of requirements for things on the policy side for fail safe methods to say if you show up at a polling place and you're not listed in the roster that you aren't turned away that there's an opportunity for you to cast a ballot. A number of those types of things. Also voter registration was moved from the local level to the statewide level. It required that states adopt a statewide voter registration database. But staying focused on the voting technology, the actual voting equipment, what followed is most of the country quickly moved to very fast engineered direct recording electronic voting equipment, paperless direct electronic recording equipment. And then fast forward to 2004 you had the next presidential election. All this money was spent, all this movement to these touch screen paperless voting systems and lo and behold, there were still issues on election day in 2004. It wasn't as close as it was in Florida in 2000 but there was certainly a doubt about the viability and the reliability of the touch screen voting systems and the lack of a paper trail. So to bring that back locally what happened here in Los Angeles and some of this predates me is it's Los Angeles is a very unique jurisdiction from an election standpoint. We have 5.2 million registered voters here which means as a local government we have more voters in this county than 42 of the states have statewide. So we operate in a marketplace that produces voting systems in this country that really doesn't build something for the capacity of a county like Los Angeles because the customer base is much smaller counties out there. So when we did the assessment after that new federal law passed and here in California there was a voter initiative to replace punch card voting systems in California. There was no system available on the market that could meet the needs of Los Angeles County. So we retrofitted our punch card system into what if you're a voter here in Los Angeles County you may be familiar with the Incavote system. We basically switched the IBM punch card system to an optical scan system where you fill in the bubbles with ink and we run them through retrofitted card readers that use infrared to read the mark on the ballot instead of the old chat. That was always intended to be a temporary solution waiting for the market and the regulatory environment to expand and provide an opportunity that would represent the new modern voting system for Los Angeles County. But as I said nationwide, those new systems when they hit prime time they kind of flopped. And here in California the Secretary of State actually decertified those new systems that that federal money was spent on and counties reverted back to their old system. So we sort of dodged a bullet here in Los Angeles County. We were on one hand, it was unfortunate there wasn't a market solution for us. On the other hand, we didn't spend all that money because there wasn't something that would work for us. We waited a few years for a solution to present itself and for the regulatory environment to open up to new more modern approaches to this problem. That never happened. So we decided in 2008, shortly after the 2008 election that we couldn't wait any longer. That if we waited any longer the current system that we had just local elections jurisdiction in the country found themselves unable to conduct an election because we couldn't get the parts for our voting equipment and we couldn't support the software changes that would be needed for new legislation. So we launched this voting systems assessment project that had morphed into the system that we're talking about today. And we sort of went out on a leap of faith and said if we're gonna do this on our own and do it as a grassroots effort, then we want to free ourselves of the regulatory constraints, the fiscal constraints and even largely to the technical constraints and start from the perspective of the voter. If we were gonna vote a voting, build a voting system and a voting experience that met the needs of the 5.2 million voters in Los Angeles County, what would that look like? So we spent a lot of time talking to voters and prospective voters about what they liked about the voting experience. Currently what they didn't like about it. We talked to people who were registered to vote but who don't vote and ask what is it about the process? Not the politics, but what is it about the process that you see as barriers to participation in the voting process? And that morphed into our introduction to a team or a company called IDEO in San Francisco. They're an international human-centered design company and we partnered with them to again go on this journey to design a voting experience, not a voting system. We purposely didn't look at this as solely a technology or a system project. We looked at it as a voting experience project. There were a few things that pertinent to today's discussion that were really important to us about that is one, we wanted to do this in a way that whatever work we did here could be leveraged by other jurisdictions. So other states and other counties could move in this direction as well. We wanted to expand the market because the market was too small for our needs and it really was limited and still is limited to a very small number of proprietary private sector firms that basically provide the voting equipment that is used throughout the country. We wanted in LA County to have a voting system that was publicly owned and publicly operated. We're large enough that we have our own IT resources to support a system in-house and we believe and still believe that the voting is fundamental to our participative form of government and that that process, if there was any system that ought to be long to the public and be owned by the public, it ought to be the voting system. And so in the course of doing that design and coming up with this new voting experience, we also wanted to do this in a way that it could be shared but could be shared in a manner that didn't require significant amounts of money by other jurisdictions. We've made a big investment in this project and it's an investment that we believe wasn't made when the country moved from punch cards to touchscreen technology. And we want to be sure that our investment can be used to leverage that system beyond our borders. So where we are today is we are on the eve of the launch of this new voting experience. We will go live in March 2020 presidential primary election with our new voting experience. We launched the new vote by mail portion and the new tally portion of that system last November went extremely well. And beyond the technology, the whole experience will be different. So we're moving to vote centers, we're moving to multiple days of voting, separating the tally portion from the equipment that marks the ballot. And you'll see some of that as we go forward in the presentation. The user interface on our voting equipment and our tally system have been built on open source platforms, open source stacks. One of the challenges and Mark alluded to this and I'm sure Kenneth will talk more about this is how do we now move a regulatory environment that has always looked at voting systems as end to end systems with proprietary software? How do we create a governance model that allows for the sharing of that data and the ability to get contributions from communities like this to enhance it, improve it and develop it as we have new legislation and new needs, but do that in a way that also meets the security needs associated with conducting elections. I mean, as I think everybody here knows that every day since the 2016 election, we've still been talking about the security of voting systems and the nation state actors that want to hack into voting systems and that type of thing. So there are a number of layers involved in that. We need to get it right for LA County first. We're committed to doing that. We don't want to become a voting system vendor ourselves. We want to put it out to the community. We want to put it out to the industry and have it grow and evolve, but we have to meet the needs of our community first. So there are certainly people in the open source community who have been frustrated with the project because I think they support the open source development model, but they've been frustrated with the inability to interact with it and for it to be shared. So we want to get there, but we need a governance structure to do that. That's what led to the conversation with Mark and others. We've established an open technology working group that we hope will help us do that and we're in conversations with the regulators on that to the Secretary of State's office and other people. So that's probably more context than... You gave me like a list of questions to ask you and you went through like 10 of them. So it's cool. I will forward a question from a gentleman in the audience, which I think you're going to touch on, but I want to make, this is I think an important point that a lot of us here are thinking about. So we'll go through the list of questions, but I think that one of the things that a lot of people have top of mind is trust. So like, do I have a tactile feedback? When I do the Inca vote, like I can see that the vote is there and now we're moving into a touch screen, like how do I know that my touch is going to actually be validated and registered? And maybe every 20th vote gets flipped by a nefarious hacker instead of every third. And so it's really hard to tell. So I know we'll go through the list, but I think that the trust is the thing that we should definitely make sure to talk about. What do you want to talk about that now? Because the next question is talks about why does VSAP have to be an open technology? And maybe you can maybe trust it would be part of that discussion. I don't know. Yeah, I'll try to hit both of them. I wanted to go back to the two points that Dean talked about was wanting to share and then the security. When we launched the program and we established an advisory committee and a technical advisory committee, we established a set of principles that we're gonna, oh, closer, that we're gonna guide the development of the solution and guide the program more generally to its successful completion. Two of those, of the 14, one of them was security and the other one was transparency. And others were accessibility. Others were cost. I mean, there's general principles we were trying to meet. But what we found though, as you go forward, is that some of these things are intention and sometimes trying to, for example, make a system more accessible runs up against the desire to make it more secure. And throughout this project, we've been trying to balance these. But the transparency principle of the program was one of the key drivers for selecting an open source stack for the entire solution. All of the components are on open source stack, even open source programming language, everything. So that we knew that in addition to being publicly owned and something that we actually had the right to share, we also knew that the software that it was built on was something that could be shared and it could be seen. So that was one of the driving reasons why we decided to go with an open source technology stack. Really based on that principle, now what I think we're actually realizing that it's a lot more complicated just than making it open source. When our tally solution was certified this last November, right away we got requests. You know, where is this software? Or can I find it on the internet? And it kind of surprised us a little bit that people would be asking just so quickly. But we actually have to, we have to actually take into consideration other factors like the solution has to be tested, it has to meet regulatory requirements. Some of those requirements are actually pretty outdated but we still have to meet them. And so the idea of actually allowing the solution just to be out into the world and to evolve, while I think we recognize that it's a good thing, there needs to be a way to ensure that however it does evolve it stays in compliance with the regulatory environment. And that's a problem we haven't really solved yet. It's a problem we're actually trying to solve through this open technology working group. But I wanted to get to the question around the security and the trust. One of the other main principles of the solution was that there would be a paper ballot, that the voters selections in an election would be recorded on a piece of paper in an indelible way that can't be changed. And at the same time, and so that sort of speaks to the security piece of it, at the same time we know that technology is actually provides great opportunities for voters with disabilities, wide variety of disabilities, to participate in the voting experience. And so the goal was to try to find a balance between the technology and then making sure that we rely on a paper ballot that we know is secure and auditable. So the ballot marking device which I think you see some slides up here, that's for the in-person voting experience. That is the, what it does is it allows you to insert your ballot into the device to vote electronically and have all the assistive technology features of a electronic voting system. But then it prints that ballot back out. It's verifiable, you can read it, you can put it back into that machine, you can actually take it to another machine and it'll read it. That ballot is the official record, the legal record of the voters vote. And it's the printed text on that ballot. So there's printed text and there's QR code that actually makes the ballot machine readable. But the printed text is what is the official record. There's a linkage between what's printed on the ballot. There's a code next to the candidate that is selected. That same code is inside the QR code. We're using an open, as you all know, QR code is an open format that can be read by anybody's cell phone. You can read it. And the vote by mail ballot, of course, is also a paper ballot where you vote traditionally still in the bubble. So that's the key security aspect of the system is that it's grounded in a paper ballot that is auditable. Let me just make one other, I think a key distinction is the difference between the paperless touchscreen voting systems that I talked about before is that this device is a ballot marking device. The device does not retain any of the voter data. The tally system is entirely independent of that ballot marking device. The purpose and the security around the ballot marking device are around usability, security, and accessibility. It solves a lot of problems. So it eliminates the ability to make ambiguous marks on your ballot, which when people have paper and pencil, they do very interesting things with ballots. And that becomes an issue in recounts and in razors and outcomes. So this solves that issue. It also allows in a jurisdiction like LA County where we have to provide our voting materials in 12 or 14 languages in addition to English, it gives us the ability to present that in a way that's not just compliant with the requirements for language accessibility, but actually functionally usable for people who need those services. So for instance, we learned in our user testing that there are voters who function just fine on the contest, the names of the candidates and the contest labels in English, but when it comes to the initiative and ballot measures, which are very complex and very legalese, that they prefer to hear those in their native language. So this device allows you to toggle between those two things. So again, the security and the usability of that device is focused on the voting experience. The tally system is entirely independent and those paper ballots come back and are tallyed centrally at our facility and we're able to then apply a different layer of security to the tally system because it functions differently. The tally system isn't about the usability for the voter. The tally system is about the accuracy of collecting that data and accurately reporting it. So there's a system independence factor there that's critically important. So when you look at it, it may look very much like the touchscreen voting systems that came out after 2000, but the functionality of it is entirely different. So again, we're now at like question 12. So. You don't have to follow up. Well, oh yeah, if we're not gonna stick to these questions, I'm gonna ask the question that, so we're all in this room fairly technical people. You have touched on some technical concerns like separation of concerns. You know, you talked about the tally system and I know you don't have it in your slides, but could you very briefly maybe using air quotes or something like that sort of talk about the different components of the systems and how they might link together so that as a technical audience, we can get a sense of well, the tally system does this and your ballot system does that or if you have a presentation like that, I can connect it up, your call. Well, first of all, we have on our website and lavote.net is the registrar's website. There is a vSAP section of that website that has a lot of information about all the components. Up here, we have some brochures. We have a brochure that kind of describes the overall solution, the different components of it and we also have a one pager on the system security but there's basically five components to the vSAP solution. The first is a ballot layout solution so it takes information from our voter registration database. It takes it from our candidate filing system, compiles all this information and lays out the contests on the ballot by their appropriate ballot style and does that in all the mandated languages that we have to support. It also generates several files that are then distributed throughout the rest of the system that configure those different components for the purposes, the purpose that they serve. The ballot marking device, which we talked about a little bit, that device is managed through another device appropriately called the BMG manager or called the BMG and that'll allow us to actually configure the devices for an election and do it very rapidly for us to be able to deploy the solution to LA County across approximately 1,000 vote centers in March of 2020. We're going to have to be able to program and test and secure about 31,000 devices in the space of about seven to 10 days. So we have a network that is pretty sophisticated that we're building. It's isolated network, it's air gapped. For those of you who are concerned about that, it's not on the internet, it sits there by itself but it allows us to very quickly push election configurations and do diagnostic testing on all of these systems over a short period of time. We have an interactive sample ballot which is essentially a website that allows voters to look up their ballot on the internet to see the contests that are on their ballot to actually mark that sample ballot in the same way that you would mark a paper sample ballot that you received in the mail. You can mark this and say, well, these are my selections. You can change it. There's different things you can do for certain voters. You'll cover voters of military overseas voters and voters with disabilities. They can also use this solution to actually mark a facsimile of the ballot and send that back into us so that they can actually leverage their assistive technology on their home and send us back a facsimile that then gets remade back in our offices. And then there's the tally solution as we talked about earlier. And that's a central system. We're using approximately 20 IBML scanners. I don't know if you guys are familiar with scanners. They're the fastest scanners really. There are on the market and they're used by the IRS and banks and things like that where they have to process high volumes of paper in a very short period of time. That those scanners will scan the ballot to a digital image and then the tally system does image processing on those images to either detect the mark on a vote by mail ballot or to decode the QR code on the ballot marking device ballot and then does the tabulation of those and the reporting. One thing I wanted to add about the tally system which I think was really important to us, our system that we have today is very old and we've been barely keeping it going. There's a lot of constraints that we've been doing workarounds for to keep this system going but it's an old system written in C and COBOL and assembly and super fast, super lean and we use IBM 312, Hollerith, the old card readers that used to be your primary data input. Those things, those are still today what we use for our vote at poll ballots. This is what we're gonna be getting rid of, right? That system unfortunately is very, it was definitely not written, it was not programmed or designed for any kind of usability or transparency. It's a hard system to understand. People wanna see it all the time. They wanna know how it works and we show them this blue screen. It does work too. Oh, it works very well, works very well and it's very fast but it's not, it's really hard to explain to people what it's doing. So one of the design of the new tally system, we really designed it so that you could actually see how a ballot goes through its life cycle in the tally system. So we actually have an interface so when we're doing auditing, we can actually trace the record in the system that we use for tabulation. We can trace it back to just the marks that were detected or the votes that were detected on the ballot back to the digital image itself and then actually back to the paper because these IDML scanners allow us as they're going through, we can apply an ID to the ballot. So we can actually trace an individual ballot from the paper all the way to the end and we can see how the system treated that ballot, how it applied business rules in the case of a vote-by-mail ballot, you can have overvotes. Voters can actually vote for two candidates and vote for one contest. Well, that has to be, we have to apply business rules so that to ensure that it can't be counted when they overvote it can't be counted. So we have to turn that from a voting for two candidates to an overvote and actually count those overvotes that you can see in the system and it's very usable. We actually did apply some usability principles to it because we want certainly auditors and we want anybody else that we have to explain how the system works. We wanna be able to do that very clearly and make it easy to understand because knowing that the system is accurate is of paramount importance and being able to demonstrate that is the most important thing. So the audience has asked, several people's audience have asked me several pointed questions. And so we'll get to those. And some of them should be given by asked by the people who asked them so that I don't play the game of telephone and screw things up. I think one question that we can go to, this is the last two questions they have on their thing while we talk, before we talk about Yeager and how they're implementing it now, right? Open tracing joke? No, okay. What are the next steps on opening this technology and how can this community help? I'll answer from the kind of organizational standpoint and then, Kenneth, if you wanna talk about it technically. I think, as I said, we obviously have a timeline. We have to have a voting system up and running for the March 2020 presidential primary election. And from the inception for us, this was about replacing the voting system and modernizing the voting experience for the voters in Los Angeles County. We've gotta get that done first. And it's gotta work and it's gotta be reliable and we're confident that it will be. Then we need the governance structure for where's the appropriate place to, where's the library to put this code in? And how do we govern review and acceptance of modifications and how do we govern the licensing of that to other jurisdictions? And keep in mind for us, there are multiple sets of code. The beautiful thing about this system is it's not an end-to-end system. So ideally, somebody could get licensed to take the tally system and use that on totally different hardware and a totally different ballot layout system than what we use in LA County. Same thing with the user interface that operates on the ballot marking device. That can be used on different hardware. It's not exclusive to the hardware that we've had designed for Los Angeles County. So we need to figure out how to put that in a governance structure that allows for it to remain certified and secure, but also allows for other jurisdictions to access it and to adopt it and not to be limited by the limited commercial for-profit markets that we're in. And so again, we've established this open technology work group. Marcus is gonna be a part of that. I think about a dozen people that we're gonna pull together. We're gonna do that work now in the background while we're getting ready to run the March election. But we also think it's important that we do run the March election and we have success with that so that people will want to adopt it and use it. And it's gonna be, I believe, easier to do that once they see it function than just having it out there and wondering if it's gonna work. Yeah, I don't have any. There's one, there's a gentleman in the audience who actually has a very pointed question and I will let him ask it, so. Well, hi. So I am a member of San Francisco's open source voting technical advisory committee and everything that we do is public. O-S-V-T-A-C dot github dot i-O. As you can see what we've been up to. The elections commission itself has now put out contracts to get the work done and of course we discovered very quickly that the Secretary of State's press release about there being an open source voting system was inaccurate. I think the communication between you guys and the Secretary of State was a bit off. But we've put out a set of recommendations about what the system should be and I think that probably you'd find a very high overlap to Dean's point. We really believe strongly that you have to use an open source, an OSI approved license that complies with the open source definition. Not one of these things with funny clauses added to them. And so, we're watching what you're doing and we're encouraging it because you guys are a bit ahead of San Francisco but I think there's probably some room for sharing ideas between the two cities just as the usual love of San Francisco versus LA. So thank you for that but I mean I think, let me just say first that we have been watching what San Francisco has been doing for quite some time and we welcome opportunities to collaborate. We have had contact with various members of the team in San Francisco. I think San Francisco took a slightly different approach than we did in terms of I think San Francisco started with wanting to build an open source voting system. We came to wanting to replace a voting system and came to a policy and philosophical decision that open source was an appropriate way to do that. You're right, we learned quickly that and maybe not everybody here would agree but I learned quickly that there are lots of different ways that people define the term open source and for something that's open, the definition of it often feels proprietary because everybody wants to constrain it by their definition of what open source is. So we're much more careful about that but as we've said, we have used open source stacks for the development. It has always been our intent to share and hopefully that will be beneficial to San Francisco. San Francisco has the ability to retain its existing voting system until they have the open source product. We don't have that ability. We're at the end of the life cycle of our current voting system. So we have to have something up and running for March of 2020. So I don't think the projects are at odds with each other at all. I think that what we've done can be beneficial to San Francisco and once it's available and open for review, I think the expertise of the team in San Francisco will benefit our project as well. I will conceive you on that because otherwise we will be here. I understand. And I would say, and I think I said this in the beginning, I think for LA County on our project, the top line, what was most important to us was public ownership. And so the tally system that we got certified by the Secretary of State is in fact the first publicly owned voting system in not just in California, but I believe in the nation and we believe that that is a model that can and should be represented. And I think the other thing, I'll pass out the mic but I think this is something that's worth mentioning is that the people in this room are going to be technologically capable of assisting in this project. And that's one of the reasons we wanted to have this presentation here at scale is the reality is that LA County is LA County, right? And there can certainly be collaboration with San Francisco, but the goal would be that we would have an open election system that is owned by all people and that all people can be participating in that. But in order for that to work, we need to have a viable open source community, right? And one of the things that I am very concerned about is that I run the fraud team at Ticketmaster. My day-to-day job is dealing with people who are trying to steal your money. That is literally what I deal with all the time. And my concern would be is that if we open source this or they open source this, because all I am at this point is helping, if it goes open source, then companies who shall remain nameless but who have a very clear financial benefit from polluting that well will do so immediately unless there is a very strong body that is maintaining and making sure that the code is being kept in an inappropriately sanctified way. And at the same time, there are also state actors who are very willing and able to pollute our elections as we know, right? We see state actors at Ticketmaster trying to make passes at us. The number of times I see people in Vietnam and Russia trying to steal people's tickets is ridiculous. The same thing will happen here. The same thing will happen. We will still have people who are trying to make, who are trying to influence our political process. So we have to be very careful that we don't just sort of, this is the point that was made to me, is you can't just toss this stuff on GitHub and then call it a day and then hope that somebody's gonna come along and you're gonna get, like, if you guys heard Mitchell's talk this morning for HashiCorp, right? Dude, were people there? Yes? No? Right? Like the way that he did that was excellent and it was amazing and it was a new product. And I gotta tell you the first time I ran Vagrant up, it was a magic moment for me. But that's not the way that this can be done here. We already have a system and we can't be sitting there saying, oh, 500 people downloaded and then you have to, do we do VC? Do we do thing? It's a different process. It has to be owned by people. Let me get the, enough of me pontificating. Let me hand out some mic. Two quickies. On the Tally program, you mentioned it reads the QR codes which aren't human readable. So how do you make sure that that says the same thing? You're not gonna scan the characters. And the second question before I turn it back is, can you detect whether somebody puts some ballots through the scanner twice? Second question, yes, we can detect that. To the first one, the code that we count in the QR code is printed on the paper ballot. So there's an internal integrity to that ballot. That allows, that marks that as the official vote of record, that code, right? So we're actually using a base 30 numbering system to keep the number small. But that code is what is the official. But we don't use optical character recognition or research into it says it's not accurate enough. It's just not accurate enough and especially at speed. And it's as much as I think we'd all like to go there and the system could eventually, it could evolve to actually just go straight OCR when and if it ever gets to the point where you can get to six sigma levels of accuracy, then we will do that. But we can't do that. The beauty of the QR code is that it is machine readable. It is an open format and it also, it's signed by the way, for those of you who want to know it, it will be digitally signed. But it also, there's redundancy in it. You can't actually miss, if you can't read the QR code, you just don't read it. If you read it, you read it successfully. And that is actually far and away better than what you get from a vote by mail ballot, right? Where you're trying to detect a vote mark that could take any number of shapes or forms, depending as Dean mentioned, on what the creative voter wants to do on that ballot. And so that's actually extremely difficult problem to solve. The QR code is actually very efficient and it's very accurate. I just want to say there's a lot of people who have a lot of questions. So I'm going to do my best to get through this. We have 15 minutes left. You guys are welcome to stay for like the 30 minutes after. And then there's a talk about PyTorch, which very different thing, so. I have two questions about, seems like most of this talk is focused on the vote, the in-person voting system. But first, the registration system. And second, the security of the tally, the vote counting. In the 2016 primary, there were a number of people who were registered to vote, that when they showed up at the polling place, they were told they were not registered to vote. And it happened in certain, mostly urban areas. The second question, so I'm curious about whether this system is related to the registration systems. The second question I'm asking is about for example, we had a recent California Assembly Delegate election in which votes, scores that were posted after the manual counting differed, or the vote that was posted by the party three weeks after the manual counting differed in some cases by 35% from the votes that were taken on the day of. And how does your system address that? And also is there gonna be chain of custody? This is three questions, sorry. Chain of custody, what's the chain of custody for the, okay the tallying, the tallying question. I'll try to run down those really quick and then we can follow up with you later. The issue with the registration system in June, it actually applied throughout the county and it was a compatibility issue between our election management system here locally and the new statewide voter registration database. So it wasn't that people weren't registered to vote, it was that their names just weren't printed on the roster. So their registration never dropped out of the system, it affected the printing, it was chaotic and we had the fail safe method of provisional ballots, all those ballots were counted and it was an unfortunate situation. The new model, not the technology but the new model of vote centers and multiple day voting actually will provide a lot better access and ability to navigate a situation like that should it happen again. But this particular system, the voter registration system is independent of that, that's a statewide system that's owned and operated by the Secretary of State. The tally component, oh this is the tally component to, so just so if it wasn't clear from the question, the party delegate election that he's talking about was not run by the government, that was run by the party. So this system will be subject to audits, required audits before the election is certified and those audits will be done actually looking at the human readable ballots. It's also subject to recount and again the paper ballot and the human readable printed output on the paper ballot is the official legal record. So part of the certification process when we take the tally system through certification with the Secretary of State is doing volume testing and checking the accuracy of it. So a 35% variance in the outcome of a manual count to the electronic count would not be acceptable and would not be certified in this case. The audits are required, they're legally required, we do them today. Well they were in Los Angeles County and there's a clear transfer at record of that. So there are several questions over here and there's several questions over there. I'm gonna go in the order in which I saw hands going up. Hi, thanks so much, this is great stuff. I live in Orange County. Last week I believe the county board of supervisors approved moving to vote centers. It's great I think, I'm concerned about, well I'm concerned about clothing, precinct, voting, general election but we'll see how it works out. So as an Orange County taxpayer and resident, I am very eager and I think many other taxpayers and other jurisdictions around the country will be very eager for you to please the code you have to be developing for years. This is a thing called Github, it's really handy. Security through obscurity is not. So if you want a way to actually audit your code and test it and have folks in San Francisco and Orange County and elsewhere help you figure out the bugs in it and protect it from potential malicious state actors who may already be in your system anyway as you know, unfortunately. Put it out there and publish it. I'm not a lawyer but I don't know of anything that would stop the county from publishing the code you have that you spent taxpayer money developing. Thank you so much for doing that, it's great stuff. So please put it out there. I would love Orange County to be able to see it and use it, the machines you're talking about and that Neil Kelly's talking about ordering are probably totally different. We would probably save a bunch of money using your recommended system. So as fast as you can get it out there we will all, many of us would love to jump in and tell you any holes we find and stuff we can fix. The state or the state, the Secretary of State I'm sure there are great people and very competent technical folks but you can't be the competency of a worldwide community of hackers who want to have a great, secure voting system and not only in the United States. This would be very revolutionary. I would save taxpayers in the US alone potentially a whole bunch of money. So please publish it. Put it out on GitHub like Monday. Well, yeah, I'll let you guys speak to this but I have some thoughts, I think. I think we've already spoken to that so we can. Yeah, I think in the long run I think it's something that we aspire to. It's just that we have, first of all, we have our board, our board of supervisors has to give us the authority in the direction to do that. Also, we have the California Secretary of State is a certifying authority. So I know it seems in principle, it seems very simple and it is but there's a lot more to it. There's concerns around just optics, right? So if somebody does get on, they look at it and they call the LA Times or the New York Times or whatever and say, this code is terrible, right? Without any vetting, without any sort of ability to respond to that, you create an optic situation which is really, really hard to control and that's just another dimension of it but we want to get there. We want to get there, we're just trying to figure out how to do it. I don't want voters to understand this. My question has to do with the move to more centralized voting centers. Currently, vote reporting happens at a precinct level. What granularity can we expect to see in terms of vote reporting with the move to the more centralized centers? I think that was like six people's question. So great question. So the precinct level, the viability of precincts as geographic and reporting units will remain the same. So vote results will still be available by community, by precinct and all that. The difference is voters won't be constrained by having to go to a single voting location for that. So voters can go to any vote center and vote but their vote will still be recorded associated with their precinct. Just to keep, I have this one, here, here, here. Okay. Yes. Yeah, so my question is just a quick follow up on that because that was going to be my question but what is a precinct then? Maybe I don't know, maybe it's just the same as it is now because yeah, maybe we can just talk a little bit about that. Yeah, there's a legal definition for how precinct lines have to be drawn. They can't be, they're constrained by number of registered voters. They're constrained by contiguous geography and that type of thing. And again, for the purpose of being able to show that maybe the outcome was that Ken has got 60% of the vote county wide but in my neighborhood I got 70% of the vote and we need to be able to show that. I have been involved with the elections in the Florida Fiasco 2000 and was basically in company of partnership with lots of vendors till 2008 election or 2009. So my first question is regarding your certification by SO, Secretary of State, is it not necessary to get it certified in NIST now? So there isn't, right, so the certification of voting equipment is handled a couple of different ways. There is a, nationally there is a set of voluntary voting system guidelines that are adopted by the US Elections Assistance Commission with the assistance of NIST and there are independent testing labs that test to those standards. However, there's no federal mandate to use those standards. California, because those standards have been outdated because for a long time the US Elections Assistance Commission was not functioning because the president had not appointed members to it so they never adopted the newer standards. That's finally changed, but California passed a law that bifurcated us from those federal standards and adopted the new federal standards before the federal government did. So we actually have a higher standard in California and it keeps us independent so if we get into that political situation again where we're admired in partisanship at the federal level, that California can still move forward. So our standards in California are the base standard, are the latest version of the federal standards that NIST was involved in and Kenneth actually served on the working groups that created those standards. They're under review at the federal level now but they're adopted in California and the certification process requires that our system be tested by an independent testing lab to those standards before the Secretary of State certifies. So in 2002, I think 28th October when the Hawaii Act was passed for $3.8 billion, Georgia was one of the first state which came up with a single state wide solution for I think $156 million and because of a single state wide solution there was an option of being hacked quite easily. So some of the states they've mandated that we have different vendors, different solutions. So what's your thoughts on having a unified solution versus different solutions across the state? Yeah, I think Georgia's actually a good example of that. So Georgia is one of the states that actually statewide remained with paperless touchscreen voting equipment and it was all the same equipment across the state and they had some very close election outcomes in this last election and people are still talking about the viability of those systems. So I think that while there are benefits to a shared system across the state you also have to look at the risks of that that it's a single point of failure or attack. You also have to look at the difference. Again, even in California, this jurisdiction has 5.2 million registered voters. The next largest is one-third the size which is Orange County. The majority of counties in the state of California are much smaller. So their ability to support a system like we're talking about or to have the resources to take something like this on is very limited. So, and I guess the last thing I would say is I think that the record of the statewide systems in California sort of speaks to itself. As a software developer, I'm just curious, are you able to share any kind of details about your testing suite for kind of the software development life cycle that you went through? Which are vendors here, I was gonna- Are you phoning a friend? I am phoning a friend. No, I have a list of some of the software, but I don't know if it includes the actual, the suites. So we have two vendors, we have two vendors or two partners that are developing the solution for us. They have systems implemented to support continuous integration and continuous testing. So it's sort of beyond my much familiarity to be able to explain it, but I can tell you that they have test packages that they create, they go through their test cases, the reported back out bugs out to a bug log, they're using JIRA to support backlog management and user stories and things like that. So they have a pretty sophisticated development environment that is organized around agile methodologies and running through a pretty disciplined series of two-week sprints over the development life cycle, but I can get you more information later if you really wanna know the specific software that they're using to support that. That's a great question. I don't know the answer to that question. It's never been asked and I'm not sure, I guess we'd have to go back and talk about that. Yeah, yeah, that's a great question. And we do, as Dean just mentioned, we are putting together an open technology working group. We're gonna be having a kick off here fairly soon. These are the kinds of questions that we're gonna be asking this team of people to help us figure out how to open this solution up. So, other questions? This goes back to opening up the source to this. There have been cities, and I believe DC has done this, and they've put their actual legal code on GitHub. And yeah, that enables you to take a really good look at that code that takes, I mean, it's the actual, these are their laws that you could follow a similar pattern to that to releasing the code maybe a little bit earlier than otherwise you would be able to. We can look at that model. Yeah, we can look at that. Again, I think to the earlier question, that's something that we aspire to. I think we just need to understand how that works in relationship to the regulatory environment that we have to operate in. I think the worst part, the worst case scenario there is that everything gets released. So, I had this problem where I wanted to develop a system in-house, and the way that Ticketmaster works is you have to get funding for it, and you have to, it's like a small company and a company, it's a whole big deal. So, I would go to different teams and I would say, this is gonna solve a lot of problems, we should build this. And every team looked at me and said, great, and I said, would you wanna build it? And they said, no, we don't have time for that. And I go to each team, they'd be like, wow, that's awesome, you should build that. I'm like, me and my zero coders, how am I supposed to do that? Right, and I think we're in a similar situation here, like, a lot of people in this room are very passionate about making sure that this gets out. So, I think the really big question is, okay, but who's gonna actually do the work? Like, is this gonna be a situation where it's like, who's gonna help me get the flour? I won't, who's gonna help me get the chocolate chips? I won't, who's gonna help me milk the cow? I won't, who's gonna help me eat the chocolate? Like, I will, you know. But hey. Please. You can hear all the regulatory stuff that they gotta deal with, all the regulatory stuff. And I think the most important thing to be sure is, some of your customers are talking to the policy and the laws, if you'll be able to do that. All of the California Constitution, all of the bills, they're in the legislature today, their text is out on the California website. Sometimes you have to know where to look under a department that has their own particular set of codes that they put together after the law was written establishing their organization. But all of the stuff, if you just work at it, is out there in text today. It is searchable, it is findable. It has changed, it has changed control with the Constitution and the stuff that's in the primary group. They did it right. They did it long before GitHub was there. Not gonna disagree. But before we get mired into discussion and back and forth, which we've already kind of done a little bit of, does anybody have any other questions? I see one over here. And then I'll come back over here. Okay, and no, I'm not gonna ask for a Braille paper ballot on the way out of the booth, but I am gonna ask about the speech aspects. I've been voting on an accessible machine since 2000 and I gotta say the interface and the layout of the machines are clunky. And the poll workers don't understand how to use them. I had one friend of mine who actually waited five hours at her polling place until the accessible machine got there. So what's gonna be with the layout of those machines and the speech? I mean, the segments, they sound like they're spliced together and the levels are all different. It's not very consistent. Thank you. I'm really glad you asked those questions because that really goes to the fundamental purpose of the way in which we've done this project because again, and I mentioned this earlier, we have intentionally gone beyond just simply complying with the minimum requirement of the law which says have one piece of equipment in each voting place that allows for accessibility. And by the way, it assumes that there could be one piece of equipment that just suddenly addresses every possible disability. We know that that's not true and we worked very hard on that element of that balance between historically in this country, that has always been the trade-off as the trade-off is accessibility for security. We from day one on this project said that we were not gonna do a trade-off on those and we did user testing. We involve people in the disability community in the design of this equipment. And so the speech pattern that you're talking about has been developed using people who rely on that. The physical nature of the device has been designed. We actually worked with the United Cerval Policy Center of Los Angeles County on the design factor with people who have severe physical limitations and who have never been able to vote independent without the assistance of another. We went with them with foam board cut-out prototypes and then developed the system, took it back to them in electronic format and watched as they were all able to cast the ballot independently. This device allows you to print out a paper ballot and deposit it into a ballot box without ever having to physically touch it so that you can vote truly independently and in secret if you so desire. So that is probably the most significant change of this particular design and something that needs to be duplicated. You're absolutely right. It's unacceptable that we ask people to go in to the corner of a booth, a corner of a room and vote on a device that the poll workers don't know how to use, sometimes don't set up and it takes three to five times longer to cast the ballot than any other voter. And this, a lot of effort has been put into that process with these devices to ensure that that doesn't happen. You referred to your ability to trace an individual ballot all the way back. What is the demarcation point for anonymity? Is it when they hand over the paper ballot to the ballot box? The ballots are anonymous. So once the voter casts the ballot there, you can't trace anything. Yeah, basically precinct is the closest you could get. That's why there's a certain, there's actually a legal requirement to consolidate up to a certain percent so that you can, and restrictions on reporting out when you don't have, when you have, yes, yeah. So yeah, so the, it's a great question. So the ballot marking device doesn't transfer any, any information that can be used to determine who that voter was. It doesn't even, it doesn't record the, you know. Well, it doesn't know who they are. Yeah, it doesn't know who they are at all. Is there an ability to at least know which machine it came from in case there's. Yes, okay. Yeah, machine ID is recorded. And you have a way to make sure somebody doesn't vote two or three times, now that you have anonymity being preserved, yeah. That's through our electronic poll book. This is a quick one. Can a person change the ballot after the ballot hasn't been printed? Yes, so the question was can, can a person change the ballot after it's been printed? Yes, up until the point that they cast the ballot, which is the point that they, that they hit the button and the ballot goes back into the ballot box, then they can change their choices. Now, once it's printed, they would have to void that ballot and they'd have to be reissued a new ballot from the election board. But as long as the ballot has not been placed into the ballot box, the voter controls the ability to, to change that. And there are at least three points of verification in the, in the user interface for them to, to look at that before they, before they cast the ballot. The QR code, is that readable by say a phone QR reader? So I can take my phone out, check that the QR code on my ballot matches. Absolutely. I can decode the whole thing and see what it says I'm voting for. You can do that on your phone, but you can also do it manually because of the codes that are next to the selections. You can actually tie it back to a, to a glossary that tells you what those codes are. Oh, so it's printing a QR code like for each race or something? No. It's printing a reference point for each, each content, each outcome or, or selection that is embedded in the QR code. It's one, one QR code, one, one QR code is printed on the ballot. It contains, it's, it's just a string. It's a series of the codes and it's only, it's only recording the code for the candidates because the tally system can actually, because it's a unique ID, it can trace it all the way back to the contest and do all the other counting that it needs to do, but it's just one QR code and it, it inserts in the QR code the, the unique ID for the selection that you made. It also prints that selection. So theoretically what you could do is you could use a QR code reader, open it up. You would, you could see all the codes that were in there and you could match it to the ballot. That's what I mean by sort of an internal auditing of the ballot. You could do it right there just to make sure that the QR code has without having to rely on any of the hardware that the county provided. Exactly. Thank you. So we know there's actually the PyTorch group is starting to come in and so, you guys we have, we have 18 minutes. So real quick before you go I just remind people that there are materials up here and then there is a website. So, so what is published is all of our research, all of our diagrams and, and a vast library of information about the project which is that the website is vSAP, v-s-a-p dot l-a-vote dot net. That's printed on the materials up here too but there's a ton of information out there that you can look at. Thanks for that. This is, this is our logo for it. Yeah. Can I have one? We are in limited demand but you know what I'm gonna give you one. Well, I know the guy who wrote the challenge. Oh, okay. That's why I said it worked. What do you mean in 1999 that they all brought it up and built that law? Yeah. In 1999, no. Yeah, we, okay. That's when I was doing it. Okay. Well, there were only two and a half hours, two and a half million voters when we did that in 1976. Wow. That became our move. Yes, yes. No, I think it's a little, I've heard that. That's probably my thing to be sure of. Yeah. Hi. Thank you very much. Thank you. It won't be anything. Leonard Pannish was the red star when. Yes. When I was doing that stuff. It was not fun but, and after it passed, that was prop 13. So they just, they went, well there's no raising. So they kind of wrecked it. Thank you very much. Thank you. Thank you. Thank you. Thank you very much. Thank you. I think you know how to create a little bit. What's the worst one? I mean, I think you just did. Oh, it's hard to take this one. I will tell you a funny story that when I was in, they'd watched it. I mean, they had to recount for a governor. Yeah. The newspaper published a picture of a ballot where they had the broken arrow format with not the oval to actually build in. And people have connected all of the right hand of the arrow to one of the left hand. And of course, there was no way to determine voter intent. And the voter actually called us after seeing that in the paper. How is that not what you put the word? No, it's the opposite. So it's like, so you've got three candidates. And they all connect over to one of the arrows. So it's interpreted and over. But we actually had a voter call. So I thought all of the paper and said, no, I did that intentionally. I wanted to divide my vote one-third for each of the candidates. I'll see you in the room. That was good. You know, I didn't know how you'd do it. Are you on the position? No, I'm present. You're the only person who's trying to protect us. Thank you. I'll be right back. I hear you. I'll be right back. We're trying to figure out what you're going to do. And so I guess we're going to answer all that. But no, it's bad. Don't put it that way. No, no, I totally agree. It's a policy. I was going to say, it's fine. I'm going to serve as a candidate. Right? Just give me one. Yeah, OK. Well, version one. Yeah. OK, go. You can put that out. They can put the job out. They can put that out. They can put that out. They can put that out. They can nail it up. They can put it in the library system. Yeah, and like, God bless them. But, you know, perhaps, I'm just glad you're here. Right? Finally, I'm going to go. I am still up. Although I am skeptical. I'm not having to do the main thing. I could have both. I'm glad you got it. But next question. So, class, yeah, this is great. So, and I think, on the optics, I think, people have the optics of, you know, quickly, quickly pushing your board. Suggesting your board. Right. Positive optics. Hard, hard. Hard. And you're already getting positive optics. Yes. And there's always people who touch out there, whatever it is. Yeah, yeah. And it's just, it's just a fascinating environment. You're in an environment where, you know, the President of the United States comes out and makes a comment about the behavior of people and becomes new for two weeks. And there's nothing factual based on it. It's just, it's just a process of development. Yeah. So, the, it's there. That's what we mean by, you know, it's there. Sure. What's the potential for the positive optics? I think it's far away. I agree. I agree. Did you, have you been a DEF CON yet? The voting privilege? Yes. Were you there last year? We sent the team the last two, the last two. Okay, great. Neal was there last year? Yep. Please come this year and bring what you can, talk about what you can. Okay. And trade it up. And there's media. Yeah, I don't know. Great stuff. Ford there, and they would love this, I think. Yep. And it'd be like changing the story if they'd done for the last two years. Yeah. And we're doing third party. We're, we're, even before we go to certification, we're doing third party impacting. I mean, too. Great. Yeah. Yeah. And, but. Of course. Thanks so much. Thank you. Thank you. Come here. I will say my honor. It's just a really stupid comment. I'm a local resident and I just want to say that you are fantastic on social media. You're really responsive. I think I had a polling place issue and I tweeted you and you were like immediate. Like, you just don't honor it. Good to hear it. Thank you. That's good job. Thank you. Hi, so I wanted to talk a little bit about like my precinct reporting because I, I'm coming from a project we presented on Thursday called Open Precinct. And our goal is to. Performance based back here. So we're, we're, we're trying to make a national database of voting precincts with elections like matched it by 2021. So that kind of redistricted chart happens. Everyone has, every citizen has a perfect quality. Yeah. And you can use a strong map of the citizen. So you can sort of commission stuff like California. You can show the map directly. We're going to try to make tools. So basically we're at the very early stages of development right now and we're trying to think about how we're based in the journey. I grew up in California but LA is so different from what I grew up. So I just want to know. There's base precincts and then there's election precincts. Yeah. I've heard a little bit about LA precincts. I don't really understand it. So I just want to make sure that what the tool that we're going to produce is going to. Right up to that. Oh, okay. But do me a favor. Send me a email. Okay. We can, if you are interested, we'd love to bring you in and we'll just talk about tracking precincts in the university like across the nation. So. Yeah. So we're developing a tool called Open Precincts. It's going to try to have a national database of voting precincts with data. And I just want to make sure that like we're going to, I've heard that LA precincts are different in every single election and not. Well, that's what I say. There's differences. Election precincts and then there's base precincts. Okay. And they can show you that. And we've done a lot of work on that with the Google Voter Automation Project. And a lot of people see this text for polling in the nation or something like that. So I think we've got some good stuff we can do. Yeah. Yeah. Great. Great project. Yeah. Yeah. This is nothing. Thank you. Great. We'll thank you. Oh my God. Thank you. Yeah. We'd love to have you over. Yeah. Sure. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. All right. Yeah. Yeah. Yeah. Yeah. Yeah. What is all this? Yeah, I was scared. I was scared. I toujours, I was scared. But it was great. Yeah, I was scared. Thank you for being here. Which are you? No, that's my mic. Did you get your last one? Yeah. Put that in my back pocket right? Yeah. Put it wherever you want this stuff to go. Okay. That's reasonable. And then you want me to repeat the question? Yeah. Okay. Okay. Okay. That's reasonable. And then you want me to repeat the question? Yeah. Okay. I'll be able to hear it on that thing there. How much time do you have for questions? Whatever I finish. Have you ever talked about this? We don't have a laser pointer, do we? I'll intro myself there. Wait another two minutes. All right. I'm going to start now. Hi, everyone. Thank you very much for coming. My name is Peter Goldborough. I'm a software engineer at Facebook in Seattle. I've already got more vitamin D in the last 24 hours here in Pasadena than the last six months. So it's been a good trip so far. I've been at Facebook for about one and a half years now. And the first year or so, I worked on a project in a team called PyTorch. PyTorch is a team under Facebook AI research, which is Facebook sort of academic, out in the open, blue skies, artificial intelligence research division. And I'm going to give you a bit of a whirlwind tour about what PyTorch is. I don't expect any prerequisite knowledge about machine learning or about PyTorch. If you can read Python, that would be useful, but that's probably just reading English. So I just want to give you an idea of what PyTorch is. I don't expect any prerequisite knowledge about machine learning or about PyTorch. If you can read Python, that would be useful, but that's probably just reading English. So I just want to give you an idea of what you can use PyTorch for and what other people use PyTorch for. So what is PyTorch in a couple of words? PyTorch is an open source machine learning framework for both flexible research as well as now high performance production environments. That's exposed both in Python and C++ and now also more recently other languages too. And so it's just to focus on each of these points a little bit. PyTorch is open source. It's on GitHub. You can go on github.com. PyTorch, it's always been open source and it's actually based on a very long history of open source framework. And it's used for machine learning. So that can be any kind of scientific computing, neural network, but really anything that involves matrices and predicts something. It was originally focused on being very flexible for research, so enabling researchers to try out the wildest ideas. But now in more recent releases we've also focused on supporting researchers or engineers taking their research and deploying it to Facebook scale or start-up scale or any scale production environment. I'm going to talk about that too. And as the name implies, PyTorch was originally written in Python and exposed in Python, but we also have a very big C++ front-end which I actually wrote so you can ask me about that more. And now we also have a couple other language frameworks built on top of the C++ front-end. A little bit of history about PyTorch. So today we have PyTorch 1.0 that was our biggest release a couple months ago, which focused really on the production environment. But PyTorch is actually based on another much more venerable framework called JustTorch, which started out in 2002, which was actually quite a while ago. It started out as a scientific computing framework. Back then deep learning was merely a thing, but it was intended for any kind of scientific computing at the time support vector machines and other machine learning techniques were more interesting. And it was originally written in C++ and then the author called Ronan Kullabere, who also works at Facebook at Research, was always looking for a better and better front-end than JustT++. So at one point he released Torch 4, which was exposed in Objective-C. That sounds like a bad idea anyway, so he switched to Lua with Torch 7. And Torch 7, also called Lua Torch, was the last really popular release of the Torch framework and popular up until a couple years ago. And then in 2016 an intern at Facebook had an intern project of taking the underlying C++ back-end of Torch and building a Python front-end on top of it, and that's how PyTorch was born. But what's even more interesting is that Torch has an even longer history, so the first Torch ever was actually released about 300,000 years ago by the first Neanderthals. And it's funny, but it's also not funny because Neanderthals are still more advanced than NEAI we can build even with PyTorch today. So I want to give you a bit of an idea of what you can build with PyTorch because you might not be familiar with machine learning that much. So one example use case is machine translation. So here's Google Translate, which probably doesn't use PyTorch, but for example, Facebook's machine translation, which runs about 2 billion translations on the Facebook platform every day, uses PyTorch, and the source code for the models that use that is on github.com. PyTorch.translate. Another more researchy topic is any kind of gamie eyes. You might have heard about AlphaGo a couple years ago, which was released by Google DeepMind and Facebook AI Research actually released an open source version of AlphaGo implemented in PyTorch. The original Google version was not open source and researchers at Facebook AI Research re-implemented and open sourced it using PyTorch. And another example is really any kind of computer version application. So this here is neural style transfer, which takes some existing image, usually a painting, and applies the stylistic features of that to another picture. If you go to the Facebook booth in the other building, you can actually see a lot. And a very brief idea of what using PyTorch looks like. So this is Python here. If you were to write your own model in Python, you would define your neural network, which uses other smaller neural networks built into the core library. You would create optimizers, and optimizer is something that uses the feedback you get from the training process and updates the parameters of the neural network. You would have a data loader. PyTorch, which you give some data set, some image set, or text data set, or speech data set, and load that in parallel for you. And then you would have a training loop, which loops over the data set, in this case 10 times. It gets predictions from the model for your data, computes a loss between the predictions and the labels, and then updates the parameters of the network based on the feedback from that. So that's what a typical PyTorch program would look like in about 30 lines of code. And I first want to talk about how PyTorch is useful for research, and later on about how it's useful for production, since research probably takes up about 95% of the time of any machine learning pipeline. And the first two reasons are eager execution and dynamic neural networks, and I'm going to talk about what that means exactly. And to get an idea of why eager execution, which is really PyTorch's core sort of feature, is so useful, I'm going to talk about the opposite first, which is more popular in other frameworks like static execution. So in static execution, if you write a program, for example here, c equals a plus b, then the variables a and b would not actually represent numbers, they would represent some kind of variable object that doesn't have any concrete value at that time. And when you write a plus b, you're not actually getting the value of a plus b, instead you're constructing a computation graph, which represents the operation a plus b. And then later on, you would feed values into the graph, for example, where the graph gets evaluated, and then you get the results. So as you can see, it's a bit complicated because there's this two-step process, you're first creating a graph, kind of a program inside the program, and then you're evaluating it. And that's static execution. Now with eager execution, it's actually a lot simpler, and the reason why I put static first is because eager doesn't really need any explaining, it's just a program. So in PyTorch, when you write a machine learning two and three, when you run a plus b, you get an actual value, there's no graph involved. So this is a lot easier and a lot more flexible. So in a couple words, what this means is that in both cases, you have a program, and in the eager mode, you go from the program directly to the results, whereas in the static mode, you go from the program to a graph, and from the graph, once you execute it, to a result. And the reason why you would want to do that in the first place is that, allegedly, you can take the graph, if you use some operations or reduce the amount of memory that certain operations take, you can even take the graph, put it into and then make one run in one machine, the other in another machine, and hopefully have the results faster. But it's just a lot more complicated conceptually. So in a couple words, in eager mode, the benefits are that you have no boundaries on your flexibility. If you want to download something from the internet in between your neural network model, you can do that, you can't do that in a graph. Now, here, you can just import PDB, PDB.setTrace within your model and see all the values live, which you couldn't do in the graph framework unless the framework itself provides some kind of debugging utility, which they often don't. And there's no expensive compilation process, the code just runs immediately. Whereas in the static mode, the benefits are that you can optimize it, which is often interesting for deployment, which is why I'm going to talk about PyTorch's static mode later on. And it's also easier to deploy. And the reason why it's easier to deploy is that if you need to update your model, in the eager case, your model is your code, so you would have to redeploy your code, your binary. Whereas in the static mode, you can actually sterilize your graph and then to update to your model just means sending your new model the new sterilized format on the wire. And then your server, whatever, can just load the model without any actual coaching. It can just load the model and execute the new model without any new code being deployed. That's why it's a bit easier to deploy. Another reason why PyTorch is very useful for research is that it supports various different hardware platforms. And here's a selection of the different categories of hardware platforms that PyTorch can run on. On the left is the usual CPU, which is really too slow for most research today. The CPU gets about one to three tear-ups per second. That's trillions of operations per second, running things like matrix multiplies, in this case here the infill C on five. And then in the middle with the GPU, which is really the device that sparked the most recent machine learning revolution because people found that running machine learning on GPUs, which is made really for trillions of parallel point-wise operations, is actually a lot, lot faster. So as you can see, a GPU can get between 10 and 20 tear-ups, which is about 10x what you can get with a CPU. And shown here is the NVIDIA Tesla V100, which is NVIDIA's most recent version. And even though the original revolution was kind of based on just everyday graphics cards used for gaming, more recent versions like this graphics card actually have explicit units within it, optimized for matrix multiplies found in machine learning research. So these are called tensor cores and apparently have up to 100 tear-ups per second, but that's very, it's usually limited by memory bandwidth and hard to achieve. But then the third category, where it gets very, very exciting and very specialized are ASICs, which are application-specific integrated circuits. So these are chips built specifically for particular machine learning use cases. So in this case here, this is Google's TPU, which it released about two years ago, which is really made for just getting as many flops out as possible, so this achieves between 40 and 100 tear-ups per second. At the same time, it's very limited. It can only run computer vision models that have a lot of matrix multiplies that don't do a lot of control flow or anything fancy like that. So it's very specific, but if that's all you need, they can get the most operations per second as possible. And another example here is the big basin server. So this is a typical server you would use for machine learning research. This is actually Facebook server, but it's also part of the open compute project, which means you can actually find the hardware plants for this box online and build it yourself in your garage. So this one has eight of these, NVIDIA V100 GPUs and it gets about eight times the 40 tear-ups per second, so it's a really beefy box and we have a couple of those, many of those in our data centers. Another exciting thing about PyTorch is that it's optimized for distributed training. So most machine learning models today don't fit onto a single GPU after a very short amount of time, so PyTorch has great support for running distributed training, and distributed training can mean running a single model on many GPUs on a single box, as well as many boxes and many GPUs on each box. And to give you an idea of just how simple that would be, if I have some model instantiated, the model, I can just pass that to the Torch data parallel class, which will actually take my model and automatically distribute any input they give it onto, in this case, GPU 0, 1, and 2. And if you have multi-machine setup before you have many different devices communicating over TCP, RPC, PyTorch also has support for that. It's a little bit more complicated, but still only requires a couple lines of code to set up the server on each machine. And then when you evaluate the model, you just pass it to your inputs, and PyTorch will automatically scatter it to each machine, run the inputs, and then gather everything back, and also update all the parameters appropriately. And to give you an idea of just how much of a speedup you can get, here's the training performance on a ResNet 101, which is a very popular computer vision model, running in distributed mode with PyTorch. So the green box is the ideal speedup you would get from scaling from one GPU up to, in this case, 8 GPUs. And you can see the results for running PyTorch in distributed mode using 100 gigabit TCP connections in the blue box and 100 gigabit infinity band connections in the orange box. And as you can see, PyTorch gets really quite close to ideal scale-up. And if you distribute that to even more machines, you get even better performance. For example, one popular training set in computer vision is called ImageNet, and it used to take between 20 hours and even two weeks to train a single model on ImageNet. And about one and a half years ago, Facebook had a research release paper on training ImageNet in one hour, where they basically scaled up the training on 256 machines with PyTorch running many GPUs on each machine and reduced the training time from originally a couple of weeks then a couple of days to just one hour. Of course, after that, more people published papers on training ImageNet in 19 minutes, but Facebook did some pioneering work there. And I think another important thing about PyTorch is that it's always valued simplicity over complexity. And this sort of follows the then of Python, so PyTorch was originally written in, or originally exposed in Python and still is exposed in Python. And the then of Python sort of tells you too when you have multiple options for doing something, you should always pick the simplest one. And PyTorch really does that. And I think its API is incredibly friendly to users. Even people who just started programming Python can oftentimes pick up PyTorch, because it's very native. It feels like Python, it feels like NumPy, and it's very easy to pick up. And another reason why I think PyTorch keeps this simplicity as a high priority is that it keeps this API across different languages. So PyTorch is exposed in a number of different languages, and they're all built on top of a C++ back-end. So this C++ back-end provides some basic things like tensors or matrices, hardware support for GPUs or ASICs, automatic differentiation, which is the technique used to update parameters based on feedback from the training. And based on the C++ back-end, there is the Python front-end, which has always been available. And then since a couple months ago, also a C++ front-end, which is really quite equivalent in its API. A couple of community users have actually contributed more APIs on top of the C++ front-end. So we have an OCaml API now, a Rust API, a Haskell API, and a couple other languages too. And while the Python front-end and C++ front-end are from the core team itself, these other languages have actually been added on top by community users and contributed to the PyTorch project. And to give you an idea of how this simplicity is kept across APIs, on the left you see PyTorch in C++ and on the right you see PyTorch in Python. And as you can see, there's really not that much difference. So all people would expect the C++ API to be like three times the amount of code, but I think we've kept the APIs quite similar. And the same is true for the actual training loop. So you still instantiate your network, you have a data loader, you have optimizers, you run your training loop. The for loop looks a bit different, but otherwise the lines of code are almost the same. And I think that makes people's lives easier, especially if they have to use C++ because of performance constraints or reasons like that. Now to the reason why I think PyTorch is really useful for production, there's two reasons I think so. The first is that we have recently gotten great support for bare metal deployment, which means deploying PyTorch models in resource constrained environments or basically not Python. And then there's also a couple of cloud solutions which I think could be interesting if you want to run PyTorch in production without actually getting your hands dirty on writing the code yourself or deploying the model yourself. Bare metal deployment. If we go back to this original slide I had on the pros and cons of eager versus static mode, I mentioned that all the eager benefits are sort of great for research, so you want no boundaries on your flexibility, you just want to reduce the latency between the idea in your head and the code in your editor. You want debugging, you want to see values live, and you don't want to wait very long for your program to run. At the same time for production, we actually care a lot more about these other things. PyTorch would optimize, so we wanted to run faster than not faster, and we also wanted to be easy to deploy. And so for this reason, we've recently added on top of all the benefits that PyTorch already provided, also benefits of the static mode. And so right now PyTorch or before this release, PyTorch were just Python programs which were native to Python, Pythonic, intuitive, debuggable, hackable, you could use any other library, like NumPy or anything you want to. It needed Python to run, and it was difficult to optimize. So what we did instead is to just label PyTorch as it was eager PyTorch or PyTorch eager mode, and now we've added to PyTorch eager mode also PyTorch script mode. And the idea of PyTorch script mode is that we allow the user to annotate parts of your Python program in such a way that we can take that Python program, analyze it, compile it, optimize it, and turn it into a graph, and you change any of your code. And once you have that graph, you can serialize it to disk, you can deploy it, you can run it without Python, as I'll show, and you can also optimize it further to run an optimized hardware and other hardware platforms. And the way this looks is actually very simple. So this was the existing PyTorch model. And if you now wanted to turn it into a graph, all you would do is, instead of inheriting it from NM module, inherited it from script module to a method you want us to compile with Torch script method. Of course, the one catch is that we can't compile everything, so we can't compile every possible Python program, and the important part here is really an optimizable subset of Python. What this means is that we still allow you to use control flow, you can use if clauses and while loops, you just can't do crazy things like you can't download an image from the internet inside your Python program. Right now, we also don't have support for calling it to other libraries yet we also don't have some control over that library, but really 99% of PyTorch models can actually just be annotated and turn into a graph already and also be used for production environments. And then once you have your graph, you can sterilize it to disk here with the net.pt and then in your C++ server, we have a very small C++ library which you can link into your server. You can load it and then have some inputs, for example, you could have the image coming in into your web server, you turn it into an array of numbers and then execute your model without any Python dependency at all and then do whatever you want with your model. And so you can still keep all of your research in Python where your data scientists are fast and have all the flexibility of the Python ecosystem, but you can then run it without any Python dependency at whatever scale you need. Now, a couple of cloud solutions that I want you for now. I really got support by all the major cloud providers to have some kind of PyTorch integration. So Amazon SageMaker, which is Amazon AWS's kind of high-level API for use machine learning has PyTorch integration. Google Cloud also has PyTorch support. So for example, one offering that Google Cloud has is Google Colab. So this is a service where you get a Jupyter Notebook for free running on, I believe, two GPUs and already has PyTorch and also TensorFlow but also PyTorch pre-installed. So that's absolutely no cost at all. You can just get a Jupyter Notebook with two GPUs running on Google Cloud and import PyTorch and run your models on that. And Microsoft Azure, similarly, also has support for PyTorch. And here's an example of what using Amazon SageMaker with PyTorch what it looks like. It's basically a very, very simple API where you give it some Python file, you tell it what kind of Amazon instance and pass it some S3 bucket to your data and it deploys instances for you, runs PyTorch trains the model and gives you predictions from that. So that's actually a very nice API. And as I mentioned, PyTorch is quite popular. So this is about 0.01% of PyTorch users but a couple I throw up here. So Media is a big user. UberAiLabs, Twitter, and really has in terms of other industrial partners. We have just as many academic partners, for example, Stanford, CMU, Oxford, these are just really a couple off the top of my head. But really all across the world people are using PyTorch and that's both in academia and industry and really just people at home trying out machine learning as well. And if that sounded interesting there's a couple resources I would recommend. So one is we have actually a Udacity course that's free called Introduce to Deep Learning with PyTorch that's developed with Facebook AI and that sort of teaches you how to train machine learning models without any prior experience using PyTorch. And another good resource is Fast.ai. Fast.ai is a course meant for existing coders who want to basically learn how to do machine learning with a little bit of Python background. And it's also very friendly course. It used to be based on another framework but they recently switched to PyTorch as well or recently I think like in the last 6 to 12 months. And it has very great resources as well for learning both PyTorch and machine learning in general. And that's actually all I got. I'm a bit fast but I guess I give this about 36 minutes for questions. And that's also I would also recommend PyTorch.org is a great resource. We have tons of tutorials, tons of examples. But yeah I think we're going to use this time to if you have any questions about machine learning frameworks or PyTorch or Facebook.ai or Facebook in general if you have to take anything. So let me pass the mic around but before we get around to that let's thank our speaker for giving us a very well optimized speedy talk. He seems to be quite good at that sort of capability. Does anybody have any questions? Like just blue skies research but I think the requirements for joining PyTorch were mostly like have you worked with C++ before? Have you have some knowledge about compilers that was kind of useful? And generally yeah I mean I knew what machine learning was as well. So I think those three things yeah. I have an oil on the fire question. How do you compare your use of like PyTorch versus say TensorFlow? Do you see them fulfilling different use cases? Do you see them as competing technologies? Yeah for a long time the main difference was that PyTorch was really kind of meant for research because it was sort of flexible and it has this eager mode whereas TensorFlow was always less usable and people didn't really like using it but it has the benefit that you could deploy it very easily. TensorFlow even has a thing where they make a web server for you out of your model immediately. So there used to be a bit of a distinction but now with our latest release we added this whole graph mode on top. We're basically giving our existing users the opportunity to also use PyTorch in production which really closes the gaps of TensorFlow a lot. At the same time it's interesting because TensorFlow for about a year now has been working on TensorFlow eager which is basically an eager version of TensorFlow. So really we're just seeing both frameworks converging to the same thing with some different sort of flavors on top. But I think now we have pretty good parity between PyTorch and TensorFlow and we're going to go through some of the things that we're going to do before we go through a researcher production. I think about a year ago it would have made sense to use MXNet. Amazon supported it and still is supporting it. They also have a PyTorch like API. I wouldn't know why I would recommend it today honestly between PyTorch and TensorFlow today. I think it was a great control of it. It was like super open source things that was kind of a nice benefit. It was developed by some random guys out in the open. But I think today Amazon is also ramping down its use of it. Even so another framework was CNTK which is used by Microsoft but Microsoft actually three months ago decided to use it. What we usually say for Java is that you we have a JNI interface to do the model loading and execution. We don't have an interface to do the training not like DL4J or something but you can run your models on Android and stuff like that. So this is a bit more in line in comparing the two. Is there something like TensorBoard for PyTorch these days? We've made that work and Google has been helping us with that. So TensorBoard has turned into a standalone tool and we've made PyTorch work with it as well. So yeah. Let me give you this so that we can record it. Sure. So you talked a little bit about the static mode and making it more friendly for deployment. Is there like, again I'm going to make a reference to TensorFlow TensorFlow has Tensor serving there's also NVIDIA has TensorRT is there like a deployment story that's native or I know there's like some onyx support so what does that look like? Yeah. So we don't have anything similar to like a TensorFlow serving. Basically what our like deployment story is right now is this chip mode. So you take your PyTorch model you convert it into your graph and in between you could have an organization. So also once you have that graph you can always do more things with it. So that graph can be converted to onyx for example and then there is a hardware platform for example like the TPU is powered by XLA which is the compiler for the TPU and XLA is getting support for onyx so basically very soon you'll have this whole pipeline where you can take your PyTorch model turn it into this graph using the script mode and then optimize it further to run on TPU and I imagine that NVIDIA is also working on support for TensorFlow and PyTorch. Yeah. So I'm not very like machine learning is a abstract thing for me so I'm slowly I'm a programmer so I understand I can read English Python scripts but so at what point does math become like oh you need like so much knowledge of math to work with machine learning at what point is that a factor I think it becomes a factor at the point where you want to do your own research so if you're exclusively interested in using machine learning or trying out different models doing something with machine learning like building an application you don't really have to touch the math at all I think the only point where you would be touching the math is if you wanted to implement something that's state-of-the-art that doesn't have an existing like pre-canned API or if you wanted to develop something on your own so I think for example I recently built an app that you had to use some computer vision and I just used Google Cloud vision for that I didn't like need to do any math at the same time if you wanted to implement a paper that just came out like at a conference two months ago there's probably no like nice API for that there's also probably no pre-chain model for it so then you would probably have to read the paper understand it to implement it and that's probably where you would start doing math and especially if you don't want to like build your own research on top of that it also helps with debugging like if your thing starts to go sideways and you don't understand why that is being able to debug into it and say like oh this equation requires that I do this like that that helps yeah I guess that's also the thing about like I mean like you would need like advanced calculus knowledge to like read papers and implement them but then even to like do some basic like machine learning models it's good to know like what a tensor is and like understanding some operations between them having like a basic like friendly approach towards math I think is helpful but for just high level use cases I think it's not required so if you're like implementing a new model and let's say the data type is video like you have to spend a lot of time like making this thing that can take video file data like frame by frame per say let's say yeah or if you're doing audio you have to create something that it takes in audio and covers individual like data frames is there like a large sort of I guess ecosystem of like basically user contributions for different things like that that makes it you don't you don't have to reinvent the wheel basically each time with like different stuff like that yeah I think I mean but I think there's about like a hundred thousand repositories on GitHub that have its name and like a lot of those are like data loaders for specific data sets so in the actual PyTorch organization on GitHub we have PyTorch audio which is which deals with like taking audio data and turning that into a nice data set for you we have PyTorch vision which is for everything like vision related there's speech recognition tool sets tool kits as well so I think like today you wouldn't have to implement anything like that for whether it's speech, video, text and basically all of those data forms people have done research on and have built nice APIs for that for PyTorch today so I don't think you would have to reinvent the wheel it's really about picking the right project yeah another question Hi so I went through the Kubeflow thing they had here on Thursday and I was looking at your at your photo of like the different the CPU, GPU and all that for training especially how do you know in advance how much of of resources what's the right processing before you go ahead and do it especially I could think either you may not have enough resources like you're doing it on-premise sort of thing but if you're in a cloud you may get either a surprising build or I think I would if you're like training a particular model I would probably look at the paper for that model and in our experiments we train our model on like so and so many devices for so much time that would probably give you a good idea otherwise I think eventually you would get a feeling for how much compute you need for a particular kind of model whether it's like a big computer vision model you would probably need maybe like 8 GPUs or something also I think 8 GPUs would probably get you started with almost any machine model today and then if you wanted to make it even faster you would have to scale it up or scale it down if it's too expensive but I think basing it on what previous people have tried in terms of compute is probably the best idea Do you have any other any graphic providers that are your partners besides NVIDIA? Yeah, so we've been working with AMD for a long time too AMD has an API similar to CUDA where they basically you can take an existing CUDA program and make it run on AMD basically and we've been working together with AMD to also make PyTorch run on AMD I think it's sort of an alpha support so it works it's not as well supported as CUDA but we've been working hard with NVIDIA to get that working and I think that's the only major GPU provider other than NVIDIA today So yeah, AMD Any other questions? Yeah? You got 25 minutes of your livestock Thank you