 Having me, I got to see this space for the first time, I don't know, about four or five months ago. And when I came, I was here for a conference and I came to visit here at Bids for an hour or two and literally was taking pictures of the place to take back to Illinois saying this, we need this. So it's really great to be able to give a talk here now having spent a lot of time telling people at Illinois that this is such a great thing to have. So like you said, my name is Claire Sullivan and I am in the Department of Nuclear Plasma and Radiological Engineering at the University of Illinois. And today I'm going to talk a little bit about some sensor network stuff that we have going on through a few different sponsors who I just would like to throw a shout out to. A lot of what you're going to see today is sponsored by DARPA through the Young Faculty Award. But we also are funded through the National Nuclear Security Agency through a couple different consortia, predominantly the Consortium for Non-Proliferation Enabling Capability or CINEC. But we also are with a couple other of the NNSA consortia and have some funding from DITRA. So just have to give my thanks there and just have to start with the obligatory outline slide of what are we going to talk about today and more specifically when are we going to be done talking about it. And so I'm going to talk a little bit about the problem but here's the preview. Nuclear SCEC, the illicit movement of nuclear material that could be in something like a nuclear weapon or a weapons component. Maybe it's a dirty bomb, some sort of radiological dispersal device. But the idea is that we are detecting ionizing radiation. Now I know that in this particular environment we'll have people ranging from nuclear engineers and radiation detection experts to people who don't know anything about radiation detection but are brilliant data scientists, geospatial people mixed in here. We're going to talk about all of it, hopefully at a level that everybody is going to be able to be on the same page but feel free if there's anything that I'm talking about that is just not something that's making sense, let me know. But okay, so the problem is that detection problem and we'll talk some about a data set that we've been using at Illinois and that we've been collecting at Illinois on we call it UIUC net. Then how can we be in a data science talk without talking about some sort of machine learning? So I'm going to talk about several different machine learning concepts and then I'm going to get totally out of my element here and start talking geospatial. I am not a geospatial person, I have a postdoc through the Cyber GIS Center at NCSA which is the National Center for Supercomputing Applications on our campus and he's the one who's done all of these GIS techniques. So I'm not a GIS person but I stayed in the Holiday Inn Express last night and then talk about how we're going to start fusing sensors in a giant sensor network and then of course we have to end with future work because this work is never done. So the problem and the data, I am going to use a term that was coined by one of my sponsors and specifically what he likes to call the dumb physicist approach. This is his term but I think it's kind of funny and what the dumb physicist approach is is let's say I am some sort of first responder. Maybe I'm a police officer or a customs inspector and maybe I have a radiation detector. Now around us all the time we have natural radiation background. It comes from the soil, it comes from building materials, it comes from people, it comes from all kinds of things. It changes, it changes with position and it changes with time. But we understand the process to be a Poissonian process so therefore we can know what its mean is and we can know what its standard deviation is and the question is, as I have my detector and I start wandering around, my detector's measuring radiation, in this case we're going to be talking a lot about gamma ray counts, gamma dose. I can figure out what my mean is and the dumb physicist approach says, well we need to set some sort of threshold, some sort of alarming threshold that says, aha I have found something that shouldn't be here, I've found a higher amount of radiation than I would expect and the dumb physicist approach uses a K sigma method where we could say K is maybe three. So if my gamma ray count rate exceeds three sigma above my mean background rate then I'm going to call that an alarm and to my particular sponsor who coined this term that is the dumb physicist approach but we in this room understand statistics know that if I set three sigma as some sort of alarm we expect there to be false alarms and we know how often those should be for a Poissonian process and so false alarms in this world are bad. False alarms lead to bad days when all of a sudden you're informing the federal government, oh geez, we think we might have found somebody smuggling a nuclear weapon. You don't want to get that wrong, right? That's going to lead to a pretty big response if you think that's really going on. So minimizing false alarms while maximizing the probability of detection is the name of the game here and so of course I have to start with the next KCD comic where we start talking about P values but really it boils down for the nuclear engineers in the room we have radiation sensors we could be measuring gamma count rates we could be measuring gamma ray spectra these are two different forms of data they tell us different things that eventually get to the same thing but we're going to be applying machine learning to a large network of these detectors and should we be applying it to gamma counts, gamma spectra both we can bring in geospatial information how can we bring in all of these tools to play to maximize that probability of detection while minimizing the probability of false alarms so I'm going to talk about a bunch of different things today I'm going to talk about some of our learning methods and specifically as they apply to background we'll get into the GIS methods and then I've got some sensor fusion stuff to talk about let's talk a little bit though about the data itself so one of my sponsors I said was DARPA DARPA has a program called Sigma and Sigma's goal is to get a whole bunch of really inexpensive radiation detectors mobile, time-tagged, geotagged collection out into the field and when I mean a whole bunch what I mean is on the order of 1,000 or 10,000 detectors okay, with the type of detectors that we're using this could be somewhere around two terabytes of data per day so just give you an idea of how much data we're talking about before I even met the friends at DARPA I started seeing a lot of other sensor networks coming online, oopsie, now I did it okay, so for instance this here, Safecast this is a little Geiger counter you can buy it as a kit, you can make your own it's got GPS on board and this sprung up as a citizen scientist thing after Fukushima you had obviously a pretty bad nuclear event happen in Fukushima and the citizenry wanted to know am I at a place where there is elevated radiation? am I buying food that's contaminated? is the water that I'm drinking contaminated with radiation? and so they all just started building their own little detectors and uploading their data into, you know, through a nice API you could go download this data today last time I checked it was over 35 million unique measurements most of which is around Japan but now other people have started saying hey, this is pretty cool so, you know, what happens when every single person on the planet is carrying one of these things that's some interesting data as an example of that interesting data here's the output of Safecast looking at Washington DC and you see here, we've got the National Mall here is my gamma dose and most of what you see here just nice little heat map is blue dark blue, meaning low dose except got this thing right here anybody know what that is? nope you've heard this one you know, you've heard this one too okay, here's a hint that is the Washington Monument right there close, World War II Memorial Vietnam is up here this is the World War II Memorial now the World War II Memorial for those of you who haven't been it's, first off, new but it's made out of marble and it's particularly thoriated marble and that's important because certain types of marble concentrate natural radiation emitters more than others and it just so happens that the marble they choose for the World War II Memorial is one of those so we see here elevated radiation counts now if I am using the quote unquote dumb physicist approach I say, aha, I have exceeded my threshold here set off an alarm, call in the marines whatever you're gonna do obviously we don't wanna trigger off of the World War II Memorial but here's the great news about it is it's not going anywhere okay, so I've got this information now I've got some data that says I know this spot here known as the World War II Memorial is radioactive I don't care about that now I care if somebody introduces a new source into this but now I can start looking at data that's collected over a period of time and start informing future measurements based on the past measurements and while this might seem like an obvious concept radiation sensor networks people were really starting to publish about them circa 2005 circa 2005 they were stationary detectors they maybe had about 10 detectors 10 or 20 detectors on the system and that was it they weren't processing things simultaneously they weren't looking at the response over the entire network this is all stuff we can do today but in 2005 they didn't have the computing capabilities the term data science had not yet been coined so looking at things like this we started asking ourselves well how can we take advantage of this data that's already out there and aggregated in ways that inform future alarms can I start improving that probability of detection and minimizing the probability of false alarms just by knowing what my distribution is here as historically measured another thing I saw as I was really starting to get interested in this stuff is this app here this is called GammaPix there's several okay I'm just gonna keep hitting that apparently there's several apps like this out there and what they do is they work off your smartphone you can go download this today off of the iTunes Store or Google Play and it takes the camera in your cell phone and it turns it into a radiation detector no hardware modification required now those of us who do detector physics that makes perfect sense to us it's a silicon chip your CCD camera silicon is a great radiation detector so this was obvious to radiation detection people but now let's take a second and think what if every single smartphone in San Francisco was a radiation detector what would you do with that that's a lot of data what would you do with that and moreover it's not just every smartphone we're talking every digital camera so now I'm talking traffic cameras surveillance cameras ATM cameras every digital camera the body worn cameras by police officers they all could be radiation detectors how do you make sense of that you are going to have hits on that network you will have false alarms you will have what we call nuisance alarms nuisance alarms are true alarms it's a truly radioactive thing that's setting it off but maybe it's not a thing we care about for instance a patient a medical patient who's gone in and had some sort of scan maybe they've had a PET scan for cancer diagnosis or maybe they've had some sort of internal imaging study done those people are radioactive they are truly radioactive they will set off a network like this but we do not care about that that's not my nuclear weapon that I'm looking for so when you're starting to talk about every single digital camera in a city you have to think about how do we adjudicate alarms on this thing that's a really complicated process so DARPA comes out and says okay we're going to have this thing called sigma and we're going to take these detectors right here this is a commercially available detector it will detect gamma rays and neutrons and we're going to put these in the hands of you know, whomever a thousand of these detectors to make up one network what they do is they communicate via bluetooth to an android smartphone and so the radiation measurements are done here the smartphone provides the time tag and the geotag and then you upload that someplace okay in our case we're going to show you that we upload it into the cloud you could do it into a server if you want maybe not the best idea if you're talking two terabytes a day but but we also have to think about how to adjudicate alarms on this type of system so when I got involved in sigma which I I got involved with them at the very beginning and people were saying you know, two hundred gigabytes a day, two terabytes a day and those of us who were sitting in the back of the room our heads started spinning and saying oh my gosh what are we going to do with all of this data if you just use that dumb physicist the dumb physicist approach of three sigma you have a point three percent probability of a false alarm just purely from Poissonian statistics point three percent times ten thousand detectors whose data is being collected once per second and that's a lot of things that have to be adjudicated so anything that can be done to make that point three percent number less is a good thing okay so tell you a little bit about what we're doing at Illinois so we got involved with DARPA DARPA was gracious enough to donate us a handful of about twenty three of these detectors and we got under contract I'm pleased to announce that my research group opened up the entire state of Illinois to Amazon for Amazon web services we could do this on Google we can be cloud agnostic here but we are not computer scientists so we went with Amazon because of all of their pre-existing tools to help the non-computer scientists of us make cool things so Amazon has been very helpful with us and I do want to acknowledge that what they have helped us do is we have our detector here which is like I said paired up with an android smartphone and using their kinesis fire hose and their lambda functions we take that data and it every one every second we upload the data into an S3 bucket and that's just for that's just our raw data we're exporting CSV files you know there's nothing that's efficient about this there's all kinds of ways we can make it efficient but right now this is just our version 1.0 of our system so we get this S3 bucket going every detector that we have dumps its data into this S3 bucket and then we have another lambda function that says okay when something gets dumped in here now we're going to process it into some sort of form that we can use further downstream right now what we're doing is we're outputting to a Postgres database that's hosted on EC2 instance which is fine for a small network but it is not any sort of a distributed option so our next generation we are also outputting to an Amazon Redshift database which is also Postgres but now we're talking a more distributed solution where we can be doing some proper data science on on that data set and so last semester by last semester I mean the spring semester of 2016 we took these 23 detectors and I had an army of undergraduate volunteers and their job was to carry these detectors everywhere they went 24-7 and keep them charged so that data set began in January the semester ended in May we're talking roughly 37 million greater than 37 million points over 150 gigabytes worth of data that's hosted right now on S3 okay so that's that's the data set that you're going to be seeing the results for the rest of the talk coming off of okay so machine learning let's let's start getting into some of the real meat here and the question that I asked earlier was do we want to take the gamma ray spectra and try to make sense out of them or do we want to take the gamma count rates and make sense out of them we're gonna we're not even going down the neutron path yet we're we're not ready but just looking at the data for gamma ray counts or gamma ray spectra again this data is collected once per second so every second I have a row in my database that gives time GPS the detector ID gamma counts gamma dose and then a 4096 channel spectrum once per second okay so let's start uh... by briefly talking about background if you don't know background you don't know diddly is what this comes down to background is the thing that is always present that we have to somehow alarm in the presence of when we talk signal to noise that background is pretty much our biggest source of noise that we're gonna be dealing with here when we talk in the detector physics community and we talk about something like three sigma that quote unquote dump physicist approach I have already made an assumption when I say three sigma and that is that my data is somehow distributed in a in a Poissonian or a Gaussian fashion okay when a first responder gets these detectors they're not going out and doing some sort of calibrated setup to say ah yes I understand the statistics my detector no they're putting in their pocket and they're going on so the data that gets collected in the field we don't know are they truly in the presence of a radioactive source or not are they doing their whatever their path is in the world war two memorial or are they doing it someplace with a much lower background we don't know but by assuming that we have a Gaussian distribution we've already made a mistake okay so this is looking at this is four days worth of measurements of us just trying to say what is the average background measurement for those four days and what you can see is that we obviously you know we've got a mean and some sort of standard deviation assume but we've got a whole ton of outliers we can't control where these students go okay there are some spots on campus and you'll see them when we do the geospatial stuff that are particularly radioactive we don't tell them don't go there they have to go live their lives just like the first responders have to go and do their jobs so if you were to actually plot on this is some subsets of some other data but uh... a few tests within that whole data set you can see here we thought that this is a histogram and i can measure the skew and kurtosis of that histogram now for a perfect gaussian we expect a skew of zero and kurtosis of three i have nothing of the sort okay so automatically by just stating that we want to use the three sigma approach we've made our first mistake we can't do it just off of this data set so in order for us to adequately say what background is we have to start playing some fancy math games on this distribution and what we do in my group is we use a trimmed and winsorized mean which helps us get rid ignore some of these outliers these outliers are only on one side of the distribution by the way okay those regions of higher counts you're never gonna have negative counts okay so we we trim and winsorize this distribution to get a more accurate mean to compare all of our measurements too radiation i said comes from material comes from the dirt it's not constant in position nor is it constant in time and this is what i mean by not constant in time we can have a weather event such as precipitation what happens is it rains the water gets into the soil and that water forces out rate on gas okay that rate on gas gets into the air my background rate goes up never ever have that rate on count in your house done right after a rainfall okay this is why we have to somehow take into account weather and we tried to do an attempt where we took our entire data set and said okay let's just look at was it raining that day or was it not raining that day and what did our data look like and in fact we weren't able to separate out rain versus not rain we have these two distributions here that look reasonably similar here's the problem there's multiple things going on here we don't know when the detectors were inside when you are inside you're not exposed to the dirt that's leaching this rate on gas out you have some shielding that's blocking cosmic radiation from you there's all kinds of things that work here that we have to consider so just looking at the entire data set believe it or not is not sufficient for us to tell the impact of rain so we had to do a more controlled experiment whereby my postdoc he decided that every day at lunchtime he was going to walk the same path around campus at roughly the same speed now we knew he was outside we knew what the weather was we knew he was moving at roughly the same speed and now you actually can see a correlation between your gamma background and rainfall ultimately where we've been getting our weather data from is weather underground which is a wonderful site for data scraping and it's a horrible site for actually trying to learn anything about the weather so we've gotten our own weather station up and running now so that we can have a more discretized measure of rainfall wind direction humidity things like this it's more than just rainfall you have to look at how easy is it for the radon to get into the air which is a function of the air pressure how quickly does it get moved out of there that's when right so there's it's more than just rain but you can clearly see that there is a correlation between rain and our measured gamma background okay so now let's actually do some machine learning the first thing we did because i am a spectroscopist at heart i said let's start with the spectrum because the gamma spectrum gives us a unique fingerprint as to what the isotope are isotopes that were present so i said well this this is clearly the way to go because this is what will tell us it's a nuclear weapon versus a medical patient let's start there wrong answer i'm not going to pull the wool over your eyes i'm going to tell you this is the wrong answer principal component analysis okay we're going to try and take our data set and represent it in a lower dimensional space so for example here if i have some sort of two-dimensional data set like this you can see that it's could be represented if we did a proper transform of axes i have one component this way and i have a little component this way and i can rotate it about those two components and start representing my data in a little more of a clean way and we said haha let's try doing that with gamma ray spectra sounds great and in fact there are people who do isotope identification algorithms based on principal component analysis fortunate for them their problem is a little easier okay here's an example of thirty-second spectra so we had one detector and we collected spectra for two different sources is cesium one thirty seven uh... cobalt sixty down here these are some common lab tech sources for the non nuclear engineers in the war room and uh... these peaks here this one right here for cesium and these two right here those are the unique fingerprint that i'm talking about the rest of this stuff is noise it's background it's compton scatter that's the physics of radiation or interactions in material we don't care about those parts we care is there's something that is deviating from that shape such as the presence of a peak here and so what we did was we use principal component analysis to try and come up with a very quick way of getting this shape and saying is there anything outside of that shape and when you do this for uh... thirty-second spectra yeah you can do it it works unfortunately we do not have thirty-second spectra we have several one second spectra we have a test in a training set here is any statisticians should have uh... and here's what we found was that our area under the curve uh... we can measure that this is another way of looking at a rock curve the rock curve is the probability of detection versus the probability of false alarm area under the curve is basically what percent is in the region it should be i if i had zero probability of false alarm and a hundred percent probability of detection that would be an area under the curve of one okay these are normalized things here uh... and what we see is that as we start increasing the number of components of our data that we keep we see that uh... we can within the first few components represent most of our data but if you actually plotted this is a rock curve k sigmo still performing better because we're talking about one second spectra one second spectra turns out look pretty bad i'm gonna skip through some of this we tried this with a few different algorithms on one second spectra ultimately what this boiled down to was that yeah pc a can work when you have thirty-second spectra when you have a probability of being able to have a peak in your spectrum but in one second the efficiencies of these detectors are not sufficient to be able to distinguish a peak from that that compton shape there so we said okay everybody out of the pool let's stop on the spectral route let's go back to just looking at count rates which is probably where we should have started all along i have another graduate student and he likes medical imaging and he said you know this really is a medical imaging problem we could do something like maximum likelihood estimation methods to try and get at when have we deviated from a poissonian in some fashion and so here's what we do we know that we have background we know that my background has a function of it of position and it is a function of it of time we know those are poissonian distributed when we don't have the presence of a source so we're going to assume a poissonian uh... variable here lambda that is going to be the linear combination of these two spatial and temporal components okay and then i can represent the probability here of my measurement d of x y and t is my measurement so the probability of that measurement we are going to maximize as a function of this poissonian variable which is separated into these two the temporal and the spatial components okay so then we do a lot like a head maximization we get some nice results um... we just do what they are this is taking our entire data set the greater than thirty seven million points and trying to eliminate the temporal variation and bring it back to just the spatial variation so we've done the thirty seven million points and created a map now of our spatial distribution and okay what we can see here uh... there's a few effects going on and we're still working on on this data we have some regions of elevated cones you see that there's there's a nice red region over here anybody can make a heat map and anybody can make a heat map look like anything I'm gonna show you where we can fool ourselves in a heat map okay this is campus so when we have data points here I assume that our campus is very similar to your campus which is that parking is a pain and so my students who are walking these detectors are literally walking with these detectors they're in their backpack or something now if I get off campus where parking is a little easier I'm in a car now what happens when I'm in a car first I have the material of the car around me and it's shielding my background measurement second when I'm in a car I'm moving faster okay so if I happen to come across any source like you know I don't know let's say there was a source right at this intersection I'm not sampling it for anywhere near as long as I would even in a one second thing in one second I've driven past it very quickly I've not had the same number of photons hit my detector as I would have had I walked that in one second okay so we have to be very careful when we start making maps like these because we can we can get into trouble very quickly okay ultimately though this boils down to the rock curve did we somehow do better than the three-sigma approach by using this maximum likelihood method and the answer is yes we did you'll notice that this is on a logarithmic scale down here the k-sigma the quote-unquote dumb physicist approach is the solid line down here and you can see that that went up that's that's what we want to see it do we want to see at the at the small false positive rates we want to see that we are having a higher true positive rate so good we've achieved some goodness with this now we have a lot of work to do with this okay we want to go back and regenerate those maps as a function of velocity obviously you know we can't have my guy on foot get a completely different answer than my guy driving a car that's bad so that's that's some of our future work there okay let's start talking some geospatial stuff okay so obviously we have geotagged data and i have in many talks said that geoint meaning geointelligence is the one truly unifying int it is the one thing where any sensor network whether it's radiation chem bio explosives anything like this we care about where is the thing and when is it there any other form of intelligence where is this bad guy we want to find this bad guy so so geospatial intelligence is the one thing that truly can bring together a lot of scientists in this field so i encourage particularly those of you who are students you may think of geo you know i've got a gps coordinate great i plot that as a pin on the map and i'm here to tell you that there is so much statistics involved in this so we're going to talk today about things like data depth and we're going to talk about robust statistics and we're going to talk about all these different coefficients that we can use to quantify how spatially related our two data points okay there's a lot to this do yourself a favor and take a course in geos it's truly fascinating stuff and i can't do it justice so if you ask me really hard questions i'm going to give the email from my postdoc at the cybergeo center who's actually done this work no go back sorry presie here okay spatial clustering methods the idea here is that i have two data points i know their physical separation the euclidean distance but the question is these two measurements spatially correlated so for instance lot a lot of work has been done in spatial correlation in epidemiology zika virus okay we see that all of a sudden there are particular regions in miami that all of a sudden we start seeing elevated reporting incidents of zika is that truly statistically significant and is it spatially significant can we say this is truly just miami or is this all of florida so we do all kinds of interesting math to try and say do i have spatial correlation between my data and when we talk about these correlations we can talk about a global value and a local value the global value will kind of set think of it kind of like an average over my entire data set useful but the thing that i really care about is the local value global values here's a a common metric of it called the moran's eye okay these are just you can you can go read about these in any number of geospatial books we have this other method that we've been looking at called get us or statistics uh... another method db scan these are just all some some common algorithms within the geospatial community but we need to talk about robust statistics first i showed you the plot before showing skewn kurtosis of what we thought was a gaussian that actually isn't and it turns out it's going to really burn us here too and the reason it burns us is because i'm going to try and figure out a spatial correlation between my gamma ray measurements we call the spatially lagged gamma ray counts this is just a measure of that spatial correlation i want to know is there any sort of linear relationship between these two things and here's my data okay i have a whole bunch of data down here and i'm gonna try and fit it if i just do a least squares fit i have all of this data out here some of these are outliers some of these are true value true measurements we don't know which is which i'm going to do least squares fit i'm going to come up with this slope and by the way that slope is my moran's eye okay but we do know that at least some of these guys are outliers by applying that trimmed and Windsor eyes mean what it allows us to do is to be able to flag the ones that are truly outliers and have a much better fit to our spatial correlation this is important because everything that we're going to do is going to be based on is there some sort of spatial correlation here we don't get this number right we don't get the rest of it right so we got to get this number right again windsor eyes trim means do not assume a gaussian we have no control over where these detectors go so that being said here's Lisa Lisa stands for local indicators of spatial association and what it's looking at it's it's one mathematical way of saying on a on a region by region basis i have high counts here do i have high counts here now what would i expect the sources that i'm looking for our point sources we're talking about you know for talking about a nuclear weapon you guys uh... you know call it that big i don't know maybe it's got a maximum dimension of a meter few meters away from that that acts like a point source here's where physics is not kind to the nuclear engineers in the room radiation follows a one over r squared pattern meaning if i double my distance from a source my intensity will decrease by a factor of four okay physics is not kind to us but what that means is because we know that we know what the spatial autocorrelation should look like we know that there should be this one over r squared relationship and we can start applying this by waiting different g uh... different geographic regions relative to each other so if my source is this tree over here i know that as i get closer to it and closer to it and closer to it my count rate should go up my count rate is a function of position here and i care about point sources and not distributed sources distributed sources are like the world war two memorial okay that's that's a big thing that's the stuff that somebody who's up to no good is going to bring in is not a big thing it's a point source we want to be able to get rid of the distributed things and find point so this is what we've done with lisa and what you'll see if we have a lisa heat map up here and you'll see that i have a region up here that is particularly this this is a particularly hot region on that heat map and that is true there's a it's it's probably hard to see back there but there's a building up here which is called the uh... nuclear radiation laboratory and rl it has a twelve thousand curie cobalt sixty source in and cobalt sixty is actually pretty radioactive so that actually is a point source we would expect in a lisa analysis to see a great deal of heat in the region of a point source which we do uh... here you see we've got like this one cell in particular that's particularly hot there's a particular church on campus it's um got this very nice veranda that's made of three-headed marble i don't care about that right this is a distributed source alright so when we run this through lisa we set a cutoff on our p-value what's the probability that we truly have a point source and in doing so i go from this map up here to these cells these are the only cells up here that were identified as having a p-value of less of less than point zero five it's clear that this is here but if i squint my eyes funny over here i don't see anything i don't i don't know what that is for those who are the graduates of the university of illinois that's the alma mater the base we have a famous statue on campus and it's standing on a base of marble and when we're talking about uh... scales this is about fifty meters on the side one of these cells almas probably more like five meters on the side so in this geospatial increment that is a point source so we actually are detecting the radioactivity of alma because in that paradigm it looks like a point source so that's good okay another one robust g statistics this is another mathematical way that we can start looking at things uh... i'm not going to spend too much time looking at this data because this isn't the part that we care about uh... robust db scan this is a method of looking at the clustering of data uh... what we really care about is the rocker so let's get there here's the rocker for all these different geospatial methods the quote-unquote dumb physicist approach is this red line this is back in linear scale i apologize for the varying scales so anything that we can do that is above this red line is good here's where we truly see the value of those robust statistics of using those trimmed and Windsor eyes means uh... here is g statistics without having the robust measure we're just using an assumption of a Gaussian and you can see we actually do significantly worse than k sigma however the second i apply robust statistics now i go to the blue line you can see just by using the proper a proper measure of the gamma ray background we have improved our statistics that much that is a win clustering that's actually the thing that the black line here this robust db scan uh... db scan is actually used a lot in sentiment analysis twitter analysis text processing but we use it in geospatial because ones and zeros are ones and zeros and it actually has outperformed all of them so i hope that by looking at this you can be motivated to the fact that did we do better than the quote-unquote dumb physicist approach almost everything not everything but almost everything that we have tried has been better than the three sigma approach hopefully that's something that makes my sponsor happy okay last but i want to talk about a sense of fusion this is you know we've got all these detectors we've been thinking about them as individual things are not individual the question is if i have one detector up the hill over there and i've got another detector over here why should i treat them as separate they're part of a whole network so we have to think about how do we start taking advantage looking at all of that data simultaneously looking at it back in time how do we start bringing all of the network's data into play here so there's a few things that we any fusion problem has to do the first thing is you need to align yourself spatially pretty straightforward you'd like to think for using geo coordinates uh... unfortunately i'm here to tell you that your smartphone does not have as good a dps as you think it might those of you who play pokemon you know what i'm talking okay so we have measured our geo position varying by up to a hundred fifty meters one second to the next okay that's bad because all of my math is based on knowing the geo coordinates so if i have a hundred fifty meter error bar that's bad uh... we have to align our data temporally that's easier we have time stands we know you know we know what time it is we're pretty good with time so we're just not even going to pay attention uh... we also need to talk about aligning the sensor measurements from different sensors those of you who are not radiation detection folks if i have two detectors exact same model like i have two ford focuses okay they still perform differently there's like manufacturing variations maybe one detectors slightly more dense than the other maybe one is cut a little thinner than the other these all matter so every detector will have a slightly different efficiency to it and that can create problems for us so we need to think about how we align the detector to detector measurements as well how could we do this well creaking is a very common way that this is done with other networks and in creaking our whole goal is if we know something about our mean okay and our mean is a function again of position in time i apologize i represented as new here and lamb down previous one but i have a mean and i've got an individual measurement and it differs from the mean somehow what we try to do in creating is to fit this such that we get these weights here and my weights are somehow going to uh... there's somehow going to maximize my probability of a particular measurement having happened few different ways this can be done first off there's simple creating where we know what that new value is and we know it's constant not our case we don't know what i mean we we know what it has been in the past we have a rough idea through machine learning what it might be today but right now we're going to assume we do not know that mean background value is further we could go into ordinary creaking where we say we don't know what it is but we assume it to be constant also bad assumption in our case because we've seen how things like rain can impact our our mean background measurement so instead of using simple or ordinary creaking we are forced to go into universal creaking which is quite a bit more mathematically intensive but we're gonna start talking about expectation values here and in universal creaking we assume that there is a linear function that represents my measurement my measured value at every point we have an expectation uh... function here lots of math goes into this but let me give you what the punchline is the punchline is if i compare the expectation values of two different measurements that may have two different background values if you work the math out what you wind up with is that it's just the covariance expectation value is just the covariance between the two different measure points what do i mean by that covariance is my error in my measurement here at all influencing my error in my measurement here gut response should tell you no there is no covariance but we actually measured it uh... using our entire data set we found that our covariance was down in the order of ten to the minus five so i'm good enough with calling that zero to say that universal creaking works for us so when we say that now what we can do is we can start making maps this is just one quick map of like one tracks worth of data we start making maps where we create that alignment around our mean value that is neither known nor is it constant and what this does for us is this starts telling us the world war two memorial is always radioactive we can create maps of whole regions this way now it is computationally intensive my group like i said we are not computer scientists but what we have been doing is we've been doing things using spark and spark sequel to do this math to start saying here's what this map ought to look like this is very much work in progress for us stay tuned a year from now we'll have the entire planet completely created okay so let's talk about future work now where are we going with this obviously we have a lot of work to do on the screen but it bears the probability of high payoff here having the ability to say to a first responder yeah you know what that world war two yet memorial you're standing in right now yeah that's got a higher probability of having higher gamma background so you know we're we're setting we're going to set our alarms differently based on geopositional information okay so that we believe we'll have a huge payoff on our rocker uh... fusion of other data types i've only talked to you really about gamma counts steered us away from gamma ray spectra i haven't even said anything about newtron and lord knows i haven't brought anything in really hard like lidar or other imaging modalities we haven't brought in the velocity of the detector there's all kinds of other things that need to be brought into in our overall data stream weather we're collecting this weather data we're in the midst of using it right now it's its own little sequel merge fun um... proxy signatures of proliferation all this is well and good we've got these measured data points i don't have ground truth anything that i can do is semi-supervised at best okay fortunately i don't have cases that i can point to saying this is what happens when somebody tries to smuggle a nuclear weapon into a city okay knock on wood i never have that we do need to develop math around this so we are always looking for different proxies for that same problem i'll give you an example um... non-proliferation or proliferation um... obviously very strongly in the news today and the question is what data sets are out there that might look like that that aren't truly examples of proliferation but have known answers that we can use supervised learning on and one example that was suggested was earthquake detection on twitter which i would think would be something of interest in the bay region can i using only twitter no other source of information an earthquake happens can i determine its magnitude and it's epicenter using only twitter okay well that's kind of fun i start looking at tweets some people choose to give their geo-coordinates some people don't they say uh-huh i just felt that's felt like a truck driving by okay well that's going to be one magnitude um... another tweet and we certainly saw this in the Haitian earthquake my house is on top of me somebody come help dig me out that's a different magnitude right so so we can look at things like sentiment analysis uh... we can do clustering on the tweets themselves that db scan method was developed for doing just that that type of proxy is what we're always looking for other proxy problems that might look like proliferation uh... and then obviously you probably don't have a single person come into this room and give a talk that says oh i have enough data scales fine okay obviously we need to be looking at how does our process scale up how do we ingest more data how do we do it in real time because it doesn't do any good for me to have my calculation take one year if somebody's brought you know if there's already a big stinking crater in the ground what my group is doing is as nuclear engineers we're trying to do enough data science to improve that rock curve increase our probability of detection decrease our false alarm rate and bring in people from other disciplines like geostatistics gis to help in that effort that is my last slide i want to thank you guys for coming and putting up with a really boring talk right after lunch and i'm happy to take any questions