 All right, next up we have Tom Samstag talking about the InfoCon DB. Please welcome Tom Welcome, I'm happy to announce something that I hope will be a good resource for the community InfoCon DB. So first off I have a quick disclaimer. I'm gonna talk a little bit about stats of presentations How many presentations different people have everything is as of the current state of the database which is not complete If if the database does not represent your years of presentation history Don't be offended. Nothing was omitted intentionally. Things just haven't been Catalogged yet Everything's far from complete. So with that said Who am I? My name is Tom Samstag. I go by technical Tom or tech on IRC. I Am a principal security engineer at security innovation We are a security firm specializing in application security Always looking for good engineers, especially in the Seattle area So if you understand how software works and know how to break it Check us out. Also, check out. Can you hack us? It's a cool challenge site that we have for Recruiting, but it should be fun. Even if you're not interested in a job Fun little CTF type thing. I Also hang out with the neg 9 crew. I am part of the core neg 9 Staff We are a hacker group that is now primarily Located in Seattle. We've been doing CTFs for a long time and We apparently accidentally won hacker jeopardy last night still trying to figure out how that happened So, yeah, normally I'd end my about me there, but we need a little bit more background here So I like cons I like hacker cons I like attending hacker cons This is my current badge collection. It's not as extensive as I'm sure many of yours are But I'm at the point where I'm afraid to try new conferences because I know if I do it'll be added to my must-attend every year list I Just really like attending cons Where my friends are the ones that get back from DEF CON talking about how they're never going to go to Vegas again I am the one who's already planning next year and how to make next year even better. I Also enjoy watching videos of conference presentations if you also enjoy watching presentations from conferences I Hope you you know of the site info con org This is an amazing resource. It's built out of the dark tangents project The data duplication village that originally was hosted at DEF CON and people came to him and said We want access to this outside of DEF CON. So this is the Internet-facing version of the data duplication village where he has terabytes of media from conferences going back through the hacker history So it's a great resource, but once you start using it you realize it's really just a a directory structure of Files it's it's just like browsing a an open directory structure So at the same time that I am getting really into watching stuff on info con I also watch a lot of TV and I'm at home watching TV with my Kodi media center and I love how it talks to the TV DB and It displays all kinds of meat metadata about that the shows that I'm going to watch It shows synopsis and actors and actor bios and all of that and So Those combined both info con and the way that Kodi displays that meet that metadata That's kind of what inspired me to start info con DB org So info con DB org I've been selling it as IMDB for hacker cons as of yesterday morning It's live publicly available Info con DB org. That's a snapshot of the front page as of when I made these slides yesterday So I'm probably losing some of you right now because you're just going to start browsing it and I'm fine with that But we want to talk a little bit about what went into making it Thus far because this is an ongoing project so While making this this site and this database and this huge amount of data and curating it I realized that a lot of my time was kind of split among four different goals and These four different goals are all vying for my time. They're all competing with each other and Only by balancing the four of them am I able to make this site really what I want The four goals are the platform That's the technology behind the website itself making it a good UI making it usable And then also that the admin stuff on the back end that let me manage the site and manage all the other goals appropriately There's coverage getting as many cons as possible. So getting the data from every con that's happened Turns out there's been a few conferences in the past couple years The third goal is complete make sure the data is complete Make sure that every conference that I have indexed and every presentation at every conference has as many of the data fields Populated as possible And the fourth goal is to make it accurate Make sure that everything that I am displaying is as accurate and true to what actually happened as possible So going through those goals the platform. This is actually the easiest of them Right on the about page. I give a little bit of a description about what technology goes into it It's a simple Django web app Powered by Postgres on Docker hosted by Digital Ocean The technology behind it is relatively straightforward Moving on though to coverage. So this is Again trying to bring in the data from as many conferences as possible all of the conferences eventually So There's several options depending on how the conference has their data the best option is if the conference has their data in structured data, so if the web if the the website of the conference used FRAB similar to how Torcon this year did and Tor camp Before it and then another third-party option that some conferences have used is called sketch and these Options are great because I can just download the the json files ingest at all matter of seconds Unfortunately, they're not the most common option that I've been able to do the most common option is web scraping I have to preface this with a little bit of a trigger warning here You may be offended by the next the next point here. I am doing web scraping with regular expressions I know that may upset some of you and you'll say why not use something like beautiful soup and In all honesty, I tried to go that route I'm looking at web pages for conferences that were done in the 90s or earlier All of these things are so horribly inconsistent even within the page that Trying to do any kind of a scraping Taking into account the HTML markup doomed to fail So actually most of the data in info con DB right now was scraped with regular expressions off of web pages Sorry if that upsets you So then the next problem that we run into is there are plenty of conferences that don't have archives Defconn is amazing because you can go and you can view the website for Defconn one Well, I don't know if Defconn one had a website But you can go back to Defconn two three and see the web page as it existed as they published it then Not all conferences have that a lot of them are Say a WordPress website where a slash schedule is the schedule for the year and then after that conference they change schedule and That content is lost A lot of times I I've been relying on the Internet Archive But that's still hoping that the Internet Archive spidered that site at the right time of the year to get that content and In lots of cases that hasn't been the case some of this content just isn't available on the website anymore and That leads to the third or the next potential I Actually have to credit my wife for this one Defcon. I'm sorry Torcon 13 The Internet Archive failed to get any program just any descriptions of any of the talks my wife actually volunteered to manually Transcribe the paper program And that's how all of the descriptions for Torcon 13 are in here And that might be an ongoing thread depending on what what the conference organizers are able to scrape up from their archives And then in the future Especially in the coming months and year Hopefully I'll be reaching out to more conference organizers and seeing what they have with what they have digital archives of their past stuff Maybe working out a way that I can get any structured data that they have. I don't care what format it is I can make it work Anything is better than writing a regex to scrape webpages or typing programs So that's the second goal the third one is complete this is making sure that Every data point that can have a value does So when you're scraping all this data in mass a lot of things can fall through the cracks I've got back-end scripts that find missing data So one example of that is at the bottom of the front page There's this box saying hey do you want to help? Does gercon 2017 have a conference u2 playlist URL and what that is is every time the main page is loaded it randomly finds a Data point for a conference that it doesn't have a value for and asks the user. Hey Can you help me here? So that's just the very first example of user interaction Hopefully in the future that will expand to to elicit more help from users But in addition to that for completeness, I have scripts in the back end that help find schedule holes So I know where when every presentation was I know what room it's in find the holes for me Maybe that that presentation just didn't get Didn't get scraped properly I Have automated scripts for YouTube playlist and info con org Matching of videos to presentations so I point it at a YouTube playlist for a conference and it uses YouTube DL which most of us use just for downloading things from YouTube but it's also handy for getting the metadata from a playlist and an awesome Python library called fuzzy wuzzy from a Seat geek for fuzzy string matching and it goes through and figures out which presentations match which Videos and does all that automatically I'm also Automated Automatedly extracting Twitter handles from people one of the the safeguards for that is Going through and extracting anything that looks like a Twitter handle But then making sure it's not only unique for that user for that presenter, but also Pinging Twitter to make sure it's a valid Twitter handle that kind of stuff So all those speak to completeness But then just filling in a value for every piece of data doesn't help if we don't also stress accuracy There's several ways to Focus on accuracy One thing you'll notice as you browse the site is for each presenter I store two fields one for name and one for handle an interesting side effect of our community Name doesn't have to be a legal name But it's a value that's more used like your name and a handle is more used like your handle my my classic example of this is Cheshire Catalyst is his handle and When asked what his name is he says Richard Cheshire now That's not his legal name, and he's upfront about that, but that's his name to the hacker community So there are scripts right now that look at Names and try to extract which part is a name versus a handle in all of the many ways that That hacker cons have credited people whether it be name and then parentheses handle or name slash name dash aka's every every way that you can imagine it's been done by some conference There's also a lot of work in presenter deduping. So Every way that a presenter's name can be mangled or Presented has been done. My best example of this is Marsha Hoffman from the EFF Does a lot of work in the hacker community? this is her list of credits on the page as of yesterday and We see we have almost every variation of number of F's and number of N's in her name even as recent as 2014 so finding these these duplication of of data and the air the slight errors that are introduced and Coalescing those people future User interaction is going to be a big part of this planned feature for the hopefully near future is going to be a way of Allowing people who are browsing the site to flag inaccurate data I Don't think it's going to be a Wikipedia model anytime soon where anybody can edit anything But allowing users to say hey, this doesn't look right or this is obviously a scraping error This person's name shouldn't have HTML in it That's going to be the next step of allowing the users to help make this a better site So those were the four kind of goals that I was you that I was using to steer my development in this the other the other interesting thing that I wanted to throw out here and why This is worth me being on stage and not just saying go check out this website. There's a couple interesting findings So I Announced on Twitter a few days ago that if anybody was able to guess the most prolific The the presenter who has had the most credits to their name I'd give a free bottle of booze to whoever could guess first and surprisingly nobody get Kind of surprisingly nobody guessed I kind of put it out there because I thought it would be difficult for anybody to guess because y'all Everybody thinks of the the presenters that they see that they remember that are that are active in the hacker community But they forget some people so this is currently the top most prolific presenters and you see it's actually Kurt Upsall from the EFF who Does at least one ask the EFF panel every DEF CON and at sometimes two or three more presentations every year? Nobody guessed them But then we see a lot of the common names that we'd expect Now this is one of those things that again, this is as of the data that I have This is going to change over time and it's displayed dynamically on the front page Another interesting piece of tidbit trivia the 20 plus year club so I looked at everybody who was in the database who had more than one presentation and Sorted them by the duration between their first presentation and their last presentation And we actually have people who are still presenting today who have been at this for well over 20 years And I thought I'd just share that list with you here Of course Jeff Moss dark tangent. He's been he's been doing this for a little while manual Goldstein so we start to see a lot of the people who are actually been active in Conference organization at the top Matt blaze another name that you should all recognize and then Let's see Kingpin Mudge well upon space rogue They just did a panel at DEF CON so we see a lot of that but so the these was it? 12 people have all been at in this active in this community for over 20 years This is information that you that you can't Really get a good grasp of and until you start amassing this amount of data and actually being able to pull interesting statistics out of it, so What's next for info con DB? So I'm releasing it today. I just made it active I made it publicly accessible as of yesterday It's been in active development for about a year now. So what's next? So first obviously making sure that it can actually survive all of you hitting it I've been the only one using the site for a while. So Making sure that you guys don't bring it down will be a good first step so far. It seems like it's working One early piece of feedback I had given to me was Not just reinforcing the presenters that are very active Not just reinforcing the rock star mentality, but also be more inviting So one of the next new features is going to be a page highlighting new presenters people are presenting for the first time this year As a because I mean it doesn't matter if you're Dan Kaminsky's presenting, you know while not fully cognizant or You know doing your 20th presentation or you're up here for the very first time it takes a lot of guts to get up here and we need to encourage people to share their their wealth of knowledge and and be active in this community so Highlighting people who are up here for the first time Advanced random presentation so there's there's a page link on the front page for a random presentation And so one of the upcoming features will be an advanced version of that for instance Watching a video at on your lunch break So you asking the site give me a random presentation with a YouTube link done in the past five years under 30 minutes and Maybe about this subject matter Tagging is another feature that's going to be a lot of work the beginnings of it already there You'll see it on some pages But it's not exposed very well yet, and it's not the data is not populated yet But being able to tag subject matter so that we can see So you can actually do a deep dive into it into something that interests you User data error flagging that's kind of what I was talking about earlier allowing users to help curate this data make it Make it accurate and help this to continue to be a good resource API access is something that's high on my list. It's kind of Speaks to why I did this in the first place But allowing people to access this for whatever cool projects that they have ideas for I've already seen since I made this public somebody scraping or Crawling the page that isn't a search engine that I recognize so somebody obviously wants to get all this data might as well make it easy for them and Some other ideas about ranking popular presentation some way to Allow people to see that must-see presentation so That's all I've got for you in this presentation, but unlike most Presentations up here. This isn't the end of the story for for this project. This is just the beginning so I'm releasing info con DB dot org to all of you to the community I want it to be a good resource. Feel free to drop me any Con any any comments that you have any feedback any bugs I can be reached Well, I can be reached If you go to info con DB go to the about page there's a feedback form with email and Feel free to check it out and get back to me Thank you. Any questions? Yes, it is powered by Postgres So yes a sequel database Any questions? Yes. Yes. This is a This is a project of love that that I've spent a year on I Don't foresee anybody wanting it. I don't foresee me giving it to anybody much much much to the the annoyance probably of my wife For the amount of time I have spent on it and the time I will continue to spend on it But yes, this is this is a project of the communities. I think So I have been focusing mostly on data of cons that have already happened I Don't want to be the These are the conferences this year Web page. I have seen too many of those come and go It's a hard task to try to be that timely with data. I'm looking more for the archive That said I did make sure to get this Torcon in Before that this Torcon started because it'd be kind of silly to not be able to look up this con But yeah, the focus is more on an archive of the past then Make sure you go to these cons in the future. I'm not rolling it out though Yes Yes, the question was about open sourcing it It's it's an idea. It's something I've thought of it's less a focus because it's a service But I haven't rolled it out It was more about making it functional to this point at this time Um, yeah, that's a good question. He asked what qualifies or disqualifies a conference for making the list I've largely been focused on the obvious hacker cons and there are so many of them that It's not like I am sitting around waiting for the next conference to happen I am at least as of now focusing on security related conferences, hopefully not or less of a focus on The more busy side of things and focusing more on the hacker side of things That may change in the future though right now, it's largely been ad hoc a lot of it's going to be Who can I most easily get the most bang for my buck in scraping just because there's 30 years worth of hacker history to try to import here. Yes Yeah, so I tried to find similar Similar sites before starting this trust me. I did not want another project on my plate I've got plenty of unfinished projects already. I wasn't able to find any until a good way through development I found another one it was not easy to find and In my personal opinion, I feel that maybe they spent more effort on Getting every conference and less effort on the the accuracy and the completeness side of things so For me it didn't quite meet my priorities That said it it was the only instance. I have found that was trying to do exactly this There are other sites out there that tried to do the database of conferences Mostly looking forward And they unfortunately don't seem to stick around very long. I don't know if it's just the The amount of effort to keep them timely that that kills them or not But I would be more than welcome to talk to anybody who has other examples of similar things Going forward ideally hopefully in the future it will be less about scraping webpages And more about receiving structured data. We seem to be more aware of that nowadays than we were when You know when DEF CON started Then again, like I have evidence that DEF CON is still hand editing their program or their their their website for the schedule Spaces and inconsistencies in the markup So it's going to be a battle. I think for some Ultimately, I want this to be a service that I provide on behalf of the the conferences so Hopefully they're incentive to incentivize to work with me Without impacting what all of the work that they put into actually throwing the conference Yeah, I will definitely be doing more of reaching out to the conference organizers going forward. I See more people piling in getting ready for closing. So Feel free to grab me some time after closing if you want to talk further. Otherwise check out info con DB org and Shoot me any feedback through the website. Thank you