 Okay, so I'm a bioinformatics scientist at the Children's Hospital of Philadelphia. And I work with different labs in sort of a data science capacity. It's a research hospital. So there's a lot of medical research and basic research going on there. And this work is done jointly with Dianne Taylor who's a professor at UPenn and the Children's Hospital. And I'll be talking about where we did several years ago, looking at breaches of protected health information on GitHub. So we have to go back to September 2016 when people could still go outside and mingle without being afraid of catching any diseases. And Microsoft had not yet acquired GitHub. That would happen two years later. At the time, and still scientists are being encouraged to use GitHub to try to make reproducible workflows and better document the whole scientific process. But they're not given a whole lot of training before they're just told to go out and use this tool and that can result in problems. Google has a tool called a BigQuery. And part of promoting that, they had archived GitHub from 2011 to 2016 so that people could quickly investigate the whole GitHub code base. So there were some medium posts that would look for like most common swear words in different Git commits and try to determine what the most popular language was across different domains. In the news, there were a bunch of concerns about privacy leaks. So Yahoo had recently released data about their 2014 breach where they had leaked the information of more than 500 users. And they're also sort of entertaining posts more on the development side where people were talking about how they had accidentally committed their Amazon keys to GitHub. And someone had found those and was using them to rack up giant compute bills. So these are like private keys that are used to interact with Amazon to prove who you are and access the EC2 instances that you've spun up. But if you accidentally commit them to GitHub, then anyone can rack up a large bill on your Amazon account. So we were thinking about these things, these kinds of leaks in the context of our own jobs where we deal with patient data, data that's out there in the public domain that has been de-identified and is free for anyone to use, but also private data being generated at the hospital with medical records that need to be kept internal and should never be leaked out. So this is a Wikipedia page describing PHI, so it describes pretty much anything you would need to steal someone's identity, plus all of their medical data. And we asked, given that BigQuery has now archived all of GitHub for us, it seems like it would be pretty easy to go through and look for different PHI keywords and do a quick investigation. So BigQuery, the table for GitHub was about 1.7 terabytes, and at the time it cost $5 a terabyte to do the queries and we were kind of cheap. So we abandoned this route and just went straight to GitHub itself to run a quick proof of concept test. So with the advanced queries, you can query by language. And here you can look for CSV file as a substitute for a text file. And as keywords I used last name DOB for date of birth and patient to try to pick up any PHI information as the first pass. And I found about 30 pages that matched these keywords. The results looked similar to this where for each research result, you're linked to the exact document, the file that has it. You get a highlight of where the keywords are and then you point it back to the repository so you can browse them, it will. So I ran this one Saturday night and I started going through the results one by one by hand and some of them obviously had some mock sample test data that I was not concerned with. But I did find one repository that had several patients in it. So I took those names and I cross referenced them with Facebook and in addition to the patient names, they had social security numbers, emails and physical addresses. So they were able to check that the person's hometown on Facebook matched the address that was in the leaked GitHub document and that's what finally convinced us that these were actual people and this now their social security numbers had been leaked through GitHub. So I put in an issue on GitHub that night on the repository. So a GitHub issue is where you can like comment on someone else's code. So my comment was you've leaked a bunch of patient information. A few hours later, the repository was removed from GitHub but fortunately we had cloned it locally and by going through the documents, we were able to find the company that it was related with and we went to the company's website and found that the guy owned the repository. His username was the concatenation of this first and last name and he was like featured as one of the people in the company. So the link was pretty obvious and we contacted the company as well so that they could contact patients and tell them that their data had been leaked. So all in all is about 30,000 patients that had been up on GitHub for about six months but after our investigation, it was promptly taken down. So with this in hand, we went to GitHub and notified them of the problem but they basically said they couldn't do anything about it. Like they weren't going to pull these repositories and comb through them. So we decided to comb through them for GitHub instead. So we started with BigQuery and we sent an email to Google pointing them to the repository and said, if you have an archive of GitHub, then you probably have this one repository in it and you should remove it from your public gig of archive. And they emailed us back saying that they did not have that repository and it seems that the Google archive of GitHub is not quite complete. And I tested at the time that several of my repositories were not in there. So it's not a complete view. So we figured out we couldn't rely on BigQuery to do the search and instead went back to the GitHub API and did automated queries of different keywords like patient, date of birth and SSN to pull down a bunch of search results basically. And to make the search space smaller, we filtered out things that obviously looked like test data. So if there was like test name in the file or like sample in the file name that we discarded those and ended up with about 270 repositories that might contain PHI. So those were cloned down locally. And then we use the Stanford NLP toolkit along with some simple regular expressions to look for PHI keywords like organizations, phone numbers, street addresses, social security numbers. And then of those results, we did a manual review of each repository to try to see if these were indeed real people and then try to track down the repository owners and the companies that they were associated with. So show the results that we found. So in one instance, there was a crisis center that had subcontracted out some work to a software consulting studio. And they had accidentally committed some of the data that they were working with about 2,500 patients with their medical conditions, medications, names, social security numbers. That had been up there for about five years. The repository itself had like the company's logo in there. And we were able to use that to contact the company itself so that they could contact patients and get the repository taken down. The second leak was at a rehab center that was done by an employee working there as a software engineer. It was thousands of patients and their associated doctors. It took a while to contact the repository owner and the company. So after a couple of days, we reported the leak to the US Department of Health and Human Services and eventually it was tracked down the company itself and get the repository removed from the public space. And then as a final leak, there's a popular insurance company that had hired out a software subcontractor to promote a wellness program that was conducted over several clinics. So the file paths look similar to how they look at the top. So it was organized by clinic and they leaked several social security numbers, dates of birth, addresses, and emails over a period of several months. The insurance company's website branding was all over the file names so they were able to track them down and contact with them and they were very thorough in making sure that everything was taken down and that we scrubbed all the data on our side and they had many chats with our compliance office. So to give you an include, just have maybe some tips that you might use in case you're working with patient data. Now GitHub supports free private repositories which seems like the easiest route to take to not expose anything. As an alternative, if you're at a larger company, you could host your own GitHub enterprise instance and maybe set up some internal practices for security reviews or code reviews to make sure that you're not leaking out any information when you commit. If you're a developer, try using mock data instead of real data and also tell your friends about possible PHI links and try to get the word out so that this won't happen as much. So just some additional tips. I use a project cookie cutter that lays out my folder structure in the same way across every project. And if I have any patient data, I put it in a protective folder that's also in my Git ignore file so there's no chance that it could be committed to GitHub. You could also use a Git pre-commit hook where you could hook it up to a PHI scanner. I've linked one here on GitHub so that it could flag and notify you before you commit anything to GitHub that you might have some information in your files. You could also use sites that mock up fake data for you. This one is a pay site. I'm sure there are free versions out there. And as a final note, GitHub is not the only tool that we might be leaking PHI through. If you're using containers, also be aware that you may be putting PHI into those containers and then sharing them through sites like Docker Hub or Key. They already have tools built in that will do vulnerability scans so it seems feasible that you could hook up some kind of hook or automation that could do a PHI scan as well and notify you when that happens. So that's all thanks to the organizers for the conference. Here's my Twitter handle and my email if you want to reach out. Awesome. Thank you so much. So it looks like there's two questions. If you're willing to, we have a bit of time. If you're OK to answer them. Yep. So the first question comes in the form of, as my computer slowly freezes, let's see here. Were these in tracks CSVs, hard-coded strings, or both? Great. Next question inspired by Marianna's comment. I see so many Git tutorials and I've never seen one that covers privacy. Do you know any good tutorials for Git that talk about privacy concerns? I am not aware of any of the mentioned privacy concerns now. Well then, that seems like an inspiration for folks out there to start one. Yes. And we have a third question. Also wondering if these repositories were archived in software heritage archive. Ooh, I've not checked. Or like the Wayback machine. Ooh. That's a good question. That is an amazing question. It looks like we've answered them in record time. Thank you so much, Perry, for your presentation. If there's no more, oh, one more question. Next. Have you talked to GitHub? Do you know if there's any work to reduce the likelihood that there are a vector for privacy leaks? Let's see. So initially we did reach out to GitHub years ago. They were not any help. But now that Microsoft owns GitHub, they're probably more open to being accountable for the content that's hosted there. But we've not done a second reach out since the change of hands. Awesome. Thank you so much, Perry.