 This is our comprehensive talk on the adventures in the dark web of government data Your speakers going to be marked Costa. He's the co-founder and chairman of enigma and open data infrastructure company He's got a PhD in anthropology and writes about culture and technology and without further ado I'm going to pass over to Mark. Thank you very much Thank you. Well, thanks everyone for coming. I'm certainly excited to have the opportunity to share with you all Some of my my adventures and and real fascination and passion for public data So yes, I'm mark from New York City originally Sometimes I go around the city with my laptop and a large antenna and sort of tune into you Some of the fun things that can be overheard on the the radio spectrum around the city As mentioned, I do a lot of work with kind of public and government data and the company enigma that I started We have a big sort of open source search engine called enigma public of all of this stuff that we Aggregated and bring together But I think probably to kick things off. It would be helpful to sort of get Some kind of clarity on on terms and you know, what you know, what exactly is government data? And does it really have a dark web? So it's interesting I think you know one of the easiest ways to think about like this more expanded idea of government data is that it's sort of the thing That's produced every time you come up against or hit regulation in some ways You know the we of course we have these, you know sprawling bureaucracies At you know federal state and local levels and every time you touch them They have a way of kind of kicking off some data exhaust and from sort of a reconnaissance and open source intelligence perspective This can be really good one of the really kind of interesting maps, I think at least at the US federal level to What's going on from a data collection perspective all came out of this thing from 1980 called the paperwork reduction act and basically what happened in the 70s is there was just a massive proliferation of Forms and and sort of you know government information collection instruments and all these things and it rose to the level where the Congress passed a law and What that law said was basically every time the federal government wants to make a new form they have to themselves fill out a form and Register it with the office of management and budget Just part of the executive branch and just to kind of show you here, you know This is a kind of an ordinary tax return 1040 and it has this OMB control number in it And this is great anytime you have a federal US federal government form it will definitely have an OMB number on it somewhere and You know when a government agency wants to make a new form they've got a apply to the OMB and One of the things they have to do is justify why they need the form and also estimate how You know what are called sort of the burden hours of of the forms so right now There's you know, maybe about 10,000 different unique forms that are registered with the federal government and what I find Extraordinarily remarkable is that according to the government's own estimates They require 11.3 billion hours each year of people's time to fill out So we can certainly extrapolate that there's a lot of information being produced here Just to kind of flag this if this is there's something that's kind of interesting to you guys And you want to explore further this it's kind of hard to Google for but it's called the current inventory report I made just a little bit. Lea that'll drop you right into the the sort of proper Government site and it is kind of fun because there are there is like an XML file that has all of this stuff structured in it And and you can go and play with it And so you know just to kind of flush out like you know What is this real spectrum of information the government's producing? You know if anyone's come into the country Into the u.s. From abroad you've probably seen this form It's one of the most filled out with over 300 million of them a year Things like w2 so the sort of you know tax form for your if you're on payroll somewhere So a quarter billion of those produced a year I was sort of surprised to see that these fiction Ridge cards or friction Ridge cards Actually, they're about 90 million of them filled out every year And I suppose it's not all strictly for people being arrested this one that I just found on Google images is for someone applying to be a pyrotechnic Operator, so I suppose these things are produced in lots of ways And so those are you know some of the more common forms that are produced, but There also is a really long tail here So everything from the 20 or 30 companies that actually fish off the Alaska coast near Russia and have a specific form they need to fill out to The importation of shelled peas from Kenya and things like the petroleum supply reporting system Which I'm not sure exactly what it is But it does sound like it could be interesting and juicy in some ways And so once you start to know that oh there is a form out there They're of course not all public, but there's a really interesting tranche of them that are you can start to go out and Collect information so just as an example This is you know what a federal election Commission form looks like and I don't know if you can see it But this is like a line item by line item Sort of disbursement schedule for all of the things that the Trump administration Campaign spent money on so we have a hundred and forty dollar uber credit there I'll talk a little bit more about this data set Later, but the FCC licenses, you know all commercial radios in some way in the country And you can use that data set to actually find every McDonald's drive-through in the country and also what frequencies it's Linked to the restaurant with certainly I'm sure you guys been in elevators You'll see these little inspection placards that's run at the state level But that's sort of data that you can get and learn all about what's going on inside of a building You know certainly aircraft registrations are really interesting and have tail numbers and all sorts of interesting joins you can do with radios This is an example is just the deed for the hotel that we're in And when you take it a step further, it's kind of cool because you whenever they're building permit Applications filed for changes use in the space or renovation. There's often sort of architectural drawings and things like that So this is also from the hotel the Department of Labor collects a lot of information on things like This is I think for the OSHA so the occupational safety and hazard something a Little bit of a sad case of someone who fell down an elevator shaft here But there's a lot of information produced, you know each one be visas I couldn't really show it very clearly here, but these are the 14 or 15 or whatever it is each one be visas that Caesar is applied for in 2018 and you can see they're mostly Tech, you know sort of tech programming looking people This I just kind of found and thought was kind of funny It's the 401k plan that a DEF CON communications has for the four people that are enrolled in it And this is actually one of my all-time favorite pieces So this is a customs declaration from the 1960s that the Apollo 11 mission Filed upon coming back with with moon rocks And so, you know, it just is kind of a lovely artifact of bureaucracy I think and does give us some sort of you know sense of the kinds of things that do appear Hidden away in the state and you know, I think the takeaway, you know that I want to leave you guys with just from kind of Having blown through all of that stuff is you know government bureaucracy can really be your friend I think that there's you know, certainly a key set of probably Sources of government data that are in our toolkits be they you know real estate records or corporate registrations or whatever But this is a really deep and and sort of massive well of Resources and by kind of thinking about like what are the processes and how does that potentially reflect in data? You can start to develop all sorts of new avenues for research and exploration so, you know, I have a big personal interest sort of in software defined radios and the sort of em Spectrum and I was really curious to sort of see how the public data that's available around the usage of The electromagnetic spectrum could be used to sort of ask different questions of the world I'm sure this won't be really a surprise to anybody in this room, but of course, you know radio waves are all around us You know, they're sort of in that Really cool sort of spectrum that takes us from all the visible light that we see around us to you know The FM stations in our car and the Wi-Fi and all of these things are just waves of different lengths You know, of course, you know, Marconi is often credited as being one of the the sort of inventors of radio And it's kind of amazing in its early days. It was you know, it's not surprisingly like a terribly unregulated and quite sort of chaotic Technology that was you know people were just broadcasting creating all sorts of interference And actually a lot of the regulatory regimes now that we have in the U.S Are said to sort of come as a result of the sinking of the Titanic in part because the Titanic being a new ship You know did have a radio operator on it and was sending out SOS messages, but The kind of thought was that there was so much interference on the sort of land-based stations that a lot of those messages weren't received and so that led eventually in 1912 to the Congress passing What was called the Radio Act, which became sort of the precursor to setting up the sort of FCC regulatory regime that we have today and So well, you know, so we of course now live in a world where there is a lot more Sort of attention regulation around the radio spectrum and that that's actually really Cool and exciting when it comes to trying to understand, you know, how this spectrum is being used So I'm just curious show of hands has anyone seen this map before So it looks like about maybe it's 20% of people. I'll keep coming back to this sort of throughout The remainder of the talk because it's I think a really good sort of touchstone to understand How you know how a lot of these things are existing next to each other So if you see this it's I know a little difficult to get with much detail on this screen but it it goes basically from maybe three kilohertz all the way at the top to like 300 gigahertz all the way at the bottom and each one of those little blocks is a Basically, you know a sort of reserved set of uses that that bit of the spectrum can be used for so you can see here You know the FM radio band of of course like 88 megahertz to 108 megahertz Roughly and that's sort of blocked off there But what's interesting is you can start to see that like these things exist You know next to you and alongside of course other uses of this spectrum. So You know further down in like the one sick 150s 160 megahertz Range is where this thing called a is which is like a merit like my ship positioning data is transmitted And then you know further down in the sort of next block at you know 1090 megahertz or just about a gig That's where all of the sort of aircraft are broadcasting their ship positioning day or vessel positioning data And so I just call that out to show How these different you know protocols and uses of the spectrum do you have a kind of continuity to them and of course? You know, there's a ton of politics and money at stake here And you know as we know recently, you know as sort of analog television has has all been shut down That spectrum is getting sold off and you know just last year you have 20 billion dollars being spent You know mostly by the big telco companies to get access to some of that stuff that was freed up so You know needless to say this is like a very kind of high stake if somewhat obscure in it invisible Sort of place that data is produced. So in the US, you know, of course the FCC is the main regulatory body here and They basically like collect a ton of different information and release it in two different ways The first one which has the most data is this thing called the universal licensing system and there's maybe 15 or 16 different kinds of Licenses that wind up giving they wind up get out and each one has a lot of sort of detailed information associated with it As part of like an open data initiative the FCC has done some work to unify all of that into this database Called the license view database. So I think it's maybe like a hundred columns That's sort of our harmonized across all of these things What's nice is it collects in one place all of these different licenses and it basically pops out in one CSV file This is a bitly link to a github repo I made which basically makes it relatively easy to if you have a post-grease server running You can basically run the script and it'll You know download the most recent version of this database Geocoded Geo index it and make it searchable for you And the cool thing is once you do that you can actually start to use this data to ask really targeted and specific questions about your local Environment in a way So this is I just did a sort of search of a kilometer Radius around the Caesars hotel here and said basically like for all the licenses that have been Given out within a kilometer of here Who kind of has the most of them and what are the kind of rank ordered counts of like how the spectrum is being used? so You know probably not super surprisingly the top three are all like next else. This is your sort of cell phone Stuff but you know then kind of digging in it was like sort of interesting for me to start to learn like where Where am I and what's going on around here? So Perini building company is a legitimate construction firm that has no ties with the mafia But they have done a lot of the casino construction in Las Vegas And you know certainly one of the biggest holders around here and then sort of drilling down We of course see a bunch of the casinos themselves are really big recipients of licenses I was kind of surprised to see because they'll come up later in the talk, but this firm recon robotics which By their own tagline is the world leader in tactical Micro-robot and personal sensor systems Has a good 32 licenses right in this part of Las Vegas and that in fact puts them on par with defcon Who I was very impressed to see is is quite fast tedious about making sure that? The official FCC licenses are all sort of filled out And one other thing that I I'll sort of call out here. That's I think really important when To keep in mind when you're working with these sort of government data sets is that there can be often a lot of confusion and difficulty when it comes to You know doing like entity recognition and resolution and stuff and so Towards the bottom here. We have pH WLV LLC, which I saw and I said what is that and in fact It's the parent company that you know, it's a planet Hollywood holdings that owns this casino and many others So then you know now that you can kind of start to identify what's going on around Around you geographically, how can you start to use and apply that? You know, of course It's been quite amazing to see in the last several years how cheap Software-defined radios have gotten how much that's really opened up So for those of you who don't know, you know for like literally 20 bucks you can get You know a little USB dongle that will let you tune into pretty broad Spectrum I think these will go from like maybe 50 or 60 megahertz to just over a gigahertz or something like this And they're really cool and you know very easy to just sort of get started with This is a program called GQRX, which is just a really simple sort of tuner So if you plug in one of these USBs and you know put in a frequency You can listen to whatever might be coming coming across it and So what's kind of interesting is we can start to you know, not only just look at like what is the sort of? the clustering of radio licenses around us, but actually dig into them a bit more specifically and what's really nice about these is you do get some very high-resolution information about how organizations kind of operate and function so this one is For the Caesars hotel, it's you know one of many that they have but it was sort of interesting is you know The person who actually filled out this license. His name is Eric Dominguez who is the VP of sort of facilities and engineering here And what's also included is his phone number an email address and it is his direct line I I called him so I do know that to be true. And so these things You know kind of become interesting when you're trying to think about what are other ways of You know understanding a target or a place of interest and finding things that Let you have a lot of sort of base knowledge about what's going on if anyone's interested these are sort of a big tranche of the radio frequencies that the Caesars Palace itself has Licenses for their other ones under other Entities that didn't come up in my sort of first search, but they can be ferreted out And just to kind of remind us to keep all of this in context You know we can see sort of these Caesar Palace radios are in the 450 meg zone But then just a little bit down the spectrum. We've got the radio Frequencies being used for sort of the control infrastructure around the the water system in Las Vegas and so it's a very Rich and crowded sort of space You know, but of course this isn't Only limited to these sorts of things. So there's You know Noah 19 is a a weather satellite that's that's flying around above head it all operates in sort of the 137 megahertz range and a friend of mine actually in New York built an antenna and a G cal reminder so that whenever this weather satellite is actually over the eastern seaboard. He can bring this thing outside and Actually download the images because of course, you know satellites These are kind of coming down unencrypted and are there for Gathering and that's the URL for it if anyone's interested But I was I was also kind of very curious to see in what ways Different kinds of public data could start to get joined with What we know on is available on the radio spectrum in order to do things like maybe look inside of a cargo ship So of course today Ships are that you know, it's really diverse radio stations and of themselves you can see here You know, you of course have GPS antennas and maybe satellite TV You know Pam radio antennas, but importantly up here on the top left is an AIS antenna and AIS stands for automated identification system and it's basically a radio Protocol that is used for navigation and safety and whenever a ship is underway It broadcasts some information encoded on this channel and it all basically lives around I guess 161 162 megahertz. There's two different channels that it goes on and what's Interesting is if you are, you know, it have a line of sight or have a decent antenna You can actually using one of these $20 dongles as an example receive Those AIS messages that the ship is sending off and so Here you can kind of see in this Like text box or whatever those are what sort of the raw demodulated packets sort of look like and What you can basically do it's because there's a you know people there's a great Python library called lib AIS and there's many other ones where people have sort of taken the spec and Made all the decoding but basically what data you're getting when you're listening to these ships Basically breaks down to you know what you're seeing here and this tells you things like you know the position and Heading and rate of turn and things like that But importantly it also has this thing called an MMSI and the MMSI is a sense for mobile maritime subscriber identifier, it's basically like the cell phone number of the ship and you can use that to then join with a second-order piece of Government data here. I wrote an API that was all linked in that repo that I showed earlier, but to connect to the international telecommunication union to take that MMSI identifier of the ship and turn it basically into the vessel name and some other information about the ship itself and Once you have those two pieces you can then get to the place where you can actually look inside of a ship And the way that you do that the sort of conceit here is by taking Bills of lading data that often get filed before ship hits the port That explain basically for the purposes of customs taxation everything that's inside of the ship That data is kind of made available in a very crazy way So it's the only way that anyone can get access to it is by going to the customs and border protection office in Washington, DC Giving them a hundred dollar certified check and getting a CD in return But through Enigma public we actually gather all of that data and it's free with an API on it And so is able to sort of stack all of these things together. I'm just on time Okay, so It's sort of what you know one example Another one. I'll just quickly talk about is using ADS B Sort of data, which is very similar to AIS, but it's for aircraft And there's a really interesting piece of work that was done by BuzzFeed specifically around looking for the extent to which governments were using stingray devices which You know often are put in aircraft and flown in circles, you know when they're going after a target and stingrays of course are ways to You know track and intercept some very specific cell phones and so basically what they did that was really smart is we're able to take all of the Sort of like flight ADS B flight data and there's companies like flight aware and others that aggregate it for the whole US And they applied some, you know, basic kind of analytics to it to look for all of the flight patterns over cities where planes were just kind of flying in circles a lot and Based on that they were able to identify all of these You know both airplanes that were like very clearly registered to Homeland Security or to a police department But also in addition all of these new companies that were shell companies being used by the government But that they were able to kind of back into once they knew that those companies were potentially of interest because of these unusual flight patterns You know there is You know, I think when we think about all of the different radio devices that surround us all of the time there are a lot of different opportunities and examples of Taking this sort of contextual public data and applying them to To those devices and just kind of enclosing since we're coming up on time I want to tell you about sort of another investigation that I did here around trying to understand the surveillance Infrastructure along the US Mexico border So what you're looking at here is just kind of a slightly interpolated map of all the radio licenses that are within 10 kilometers of the US Mexico border and when I was looking at them, you know that you sort of see these normal Dispersion patterns around cities of course like the radio towers and uses are all over the place But what I was kind of very interested in is sort of seeing out in sort of some of these more remote Sort of desert frontier places these very regularly space towers that were being put up along the border and This one in particular is was put up by a company called Imsar And so I started looking so to what does Imsar do well They make you know the kind of the radar packages that go the ground radar packages that go on Predator drones and everything's like that So I thought this could be interesting to try to dig in and and get a sense of who who and what else is sort of Happening along the border so this is just kind of like a Account of like who are all of these kind of entities that are showing up doing experimental work specifically along the border I just called out that company recon robotics, which was the one I had mentioned earlier is also doing a lot of work around this hotel But then I sort of went through and actually wanted to look at all these companies and basically you know found I suppose it's not so surprising but that in fact the vast majority of them are defense contractors of different stripes and So sort of starting to go through and and looking at like you know, who are these companies and what are they doing? you sort of you know stumbled upon all this really kind of Fascinating technology I suppose and why so t-com makes these aro static blimps that uses surveillance platforms Leonardo DRS is a Italian defense contractor, but they're purport to have the most widely used a ground surveillance radar And you sort of see a lot of these interesting packages ELTA is an Israeli defense company that does a lot of Border security work. That's also sort of working there as is as Elbit systems And so, you know, what's really interesting is, you know, you can again Pivot from these very specific licenses or these sort of aggregates of licenses to then go and look at like where are the sites and Where are these sorts of things happening? So it's you know kind of incredible For me to just then actually be able to go Go over to Google Maps punch these things up and start to see all of the sort of sites where these bits of exploration and and sort of prototyping this like virtual offense are starting to happen It just is a like last piece of context there there was a Bunch of these were part of an older program that Boeing had was sort of wound up being a massive disaster They were supposed to be able to cover the entire border for seven billion dollars But wound up spending a billion dollars to only do 50 miles and the thing didn't even work but You know the thing that I'll sort of leave you with and hopefully Kind of came across in the talk and sort of through these examples and context of like what's possible with data more Generally is to really think about, you know, not only where are these deeper perhaps Unseen bits of data are but really thinking about how they can be put together to tell sort of broader stories. So anyway, thank you very much