 Well, hi everyone. My name is Jose Hernandez and I'm a data scientist at the UW University of Washington eScience Institute. Just a little bit about myself. My background is in applied statistics, social sciences, educational measurement actually, and before that I did some data science work in the nonprofit sector working with education data and then way before that I'm actually from South Central Los Angeles and Santa Ana, California. Anyone hear from those two places? Awesome. Which one? Oh nice. So yeah, so you know Santa Ana and now I'm here and I never thought I would be talking about version control stakeholder reporting building an end-to-end reporting infrastructure. So I'm happy to be here. This is my first time at CSV conference. It's great. I saw the the llama or is it a it's not it's a llama but it's okay. I'm not an alpaca. Okay. All right. I I took a picture of it. I didn't take a picture with the llama. So today I'm just gonna talk to you guys about a collaboration that's ongoing that a team at eScience including myself have been kind of working on for the last couple of years and just to kind of say this I've only been at eScience for one year. So when I came in it was kind of I came into this project that already existed but it's evolved a lot from that year when I came in. I'll tell you a little bit about what the eScience Institute is kind of who we are what we do and why we exist. I try to do it justice. There's other people in this room that have way better background on this. Yeah, no pressure right. The type of collaborations and projects that we kind of foster and and develop at eScience and in particular I'm gonna focus on a project that we have with the housing authorities in Seattle and in King County and also the homelessness management intervention system in King County. And so talk about that project as an example of how we kind of started thinking about how we were gonna collaborate with these folks that are in the trenches doing work that want to have a lot of data and that kind of want to inform their practice. So I'll talk a little bit about the different types of collaboration. So there's academic collaborations where we're thinking about research and things that are more traditional in academia, but there's also collaborations with folks on the ground that want to kind of inform their practice and we find both of those as valuable as each other. So I'll talk about that project in particular and then some of the tools that we've developed and build for that project and then kind of what the collaboration looks like. It's iterative, it's changing. This is all work in progress in terms of what we're building but also how we're collaborating. We're learning as we go. It's not... There's no like science or... I mean there are best practices and we're trying to kind of figure out a way that to kind of make this work and we're using this in our other collaborations with other folks. So for example I collaborate a lot with folks in the education sector and I essentially borrow this blueprint and kind of implement it in that as well. So what is eScience Institute? Essentially we are the data science hub at the University of Washington. We really do believe that open science is a foundation for scientific discovery and there's several ways that we kind of do this or there's three areas that we really focus on. So one of them is research. So data during research both on the application of data science but also like the methodology that people use in this huge umbrella that is data science. We provide data science training. This could be in forms of workshops. So there's folks on staff that actually collaborate a lot with the data carpentries and software carpentries but there's also a lot of folks that are doing a lot of work on actually formalizing data science as a career track. So for example there's a data science master, there's a way for other programs in the university to actually have for example data science concentration in their curriculum and so that's I think that's huge kind of a huge contribution to data science as a field and ultimately we're building a community of practice that's open, reproducible, rigorous and ethical. We really do kind of stick to those principles and we try to practice them inside our academic collaborations but also outside kind of their collaborations with government organizations that have very sensitive data that are actually working with folks on the ground and trying to kind of help people have better lives locally. So like I mentioned, types of collaboration in academia as data becomes more available, as fields become more data rich, folks need a way to be able to work with these data and we're situated in a way that we can offer those resources to folks within the academic setting and like I mentioned government and nonprofits, specifically as people find to an administrative data kind of systems, we see these as previously untapped data sources that we can leverage for research and inquiry. But like I mentioned before that the organizations or the agencies that have this data can also leverage to inform their practice and kind of kind of iterate and get better at what they're doing on the ground and so these are kind of the two things and these happen either formally or informally. So formally for example, we do have some programs that folks within the institution and outside can kind of apply to be part of. So for example, the data science for social good is like a summer program 10 weeks where folks that are either within the academic setting of UW or in other kind of government organizations locally can apply and essentially they turn in like a proposal for a project with their data that they need that they might need data science support for. These folks might not have the resources, for example, to hire data scientists full time and so they get paired up if they get selected with a data scientist that's on staff and they also actually there's students that apply from all over the US to be part of these projects. So they get paired up with data scientists plus some students support in the form of four or five students per project and this is in the summer. There's also winter incubator program in the winter where traditionally it's been more academic folks, it's more personalized. So people come in with projects and they get paired up with a data scientist and they work kind of one on one on a specific kind of data topic or project and this is for the duration of the winter quarter and so they get kind of this one on one training or help on something that they want to kind of develop and from these things either spin off into bigger projects. We also offer office hours, for example, people who come to our office hours and just kind of ask us whatever they want. I had a student once come in, I think they were participating in one of these boot camps and they needed help on a project assignment that they were doing and so, you know, I helped them out and it's okay. And so from this, like I mentioned, one of the projects, for example, that came out of the Data Science for Social Good program about three years ago was a project specifically working with the homelessness management intervention system in King County and from that it was later expanded into a bigger project to incorporate the other housing authorities with the help of funding from the Bill and Gates Foundation and so they essentially, you know, they did a grant proposal where we were able to get funding to actually investigate not just the homelessness intervention system but also the housing authorities and the primary question here was kind of understanding housing instability in King County in the sense of kind of the population we're exploring was homeless individuals or people that are at risk of becoming homeless and so they really wanted to understand their interaction with the human services through these several agencies. These agencies cannot, for example, share data with each other. They can link these data and so they came to us in a sense, say, okay, well, there's these three data sources. We want to understand how people are kind of navigating or transitioning to and from either homelessness to better housing situations or from very dire housing situations to homelessness and so this kind of project was, that's kind of the essence of this project to figure that out. And some of the questions these stakeholders have, for example, they're like are there significant differences in those households successfully transitioning from like HMIS to SHA or KCHA. So from homelessness to the housing authorities. What is a successful transition look like, for example, so we spend a lot of time trying to answer these questions with the folks on the ground. And then more kind of deeper is can we model successful transitions from like HMIS to like a housing, like a subsidized housing, for example. These are the kind of questions that we, they give us and in a sense we iterate through kind of the data that we have and then try to answer the questions that they have on the ground. So these aren't intended to be like public reports that they want to actually put out. They just want to really understand like what's happening with our population and who's accessing our services, how can we get better at that. From the get-go, like I mentioned, this is messy administrative data, three data sources. There's no common key to link across these. So then they want to know these trajectories. And so it was kind of they were like how are we going to do that? How are you guys going to do that? And so we implemented kind of a data linking method, for example, that's prominent in medicine when they link medical records. We do have a lot of PII, so personal identifiable information that we can leverage to kind of find individuals across these different data sets. And so that's the majority of the bulk of the work of this grant was really figuring out going through different algorithms and implementing them and seeing which one was kind of the better one to use for our case. And like I said, it's still ongoing. Right now we're using like a fuzzy kind of matching type of string matching implementation, but we're moving towards more probabilistic method to kind of determine matches between individuals. So we're iterating at this piece as we go as well. And as you'll see later, there's a lot of pieces that are in constant iteration in terms of making sure that we're doing things right. At the foundation, of course, of everything that we do at eScience in our collaboration is really developing software that's open, reproducible, and that folks can use on the ground. We knew that we had this partnership with these housing authorities, and we wanted to make sure that whatever we used to, for example, process and clean the data that would be able to, they would be able to be adopted by them after the, for example, the grant then, so our grant has like a end date that we hope that through what we build, they can actually implement and use it in-house for their own purposes. So we're really trying hard to do that. And I'm going to go over kind of some of the pieces of software that we've developed that are open source and some that are not, and I'll mention why they're not open. So for example, Puget was based on code that was developed during this data science for social good collaboration, but it was re-engineered to fit our specific needs. So in the DSSG program, they had this homelessness and torrential management system data. They did a lot of work to process this data and clean it up and make it kind of easy to use for analysis. So then we were able to kind of just leverage that, you know, that work that was already done and build on top of it in terms of making it more general for what we needed to use. They also added a lot of test coverage. So there's a lot of kind of unit tests built into this software. And then because of the grants, we were able to focus and develop more kind of handling PII matching information, record linkage, and then we improved kind of our clustering method that we're using for this. One thing that was very easy to, in terms of adopting and using, the two people that created this code are on our team right now as well. So it was easy to say, okay, well, we're going to use this, we're going to iterate and build on top of it. So that was, I guess, easy and air quotes, but it was, it really obviously made it easy to kind of iterate and shift to like making it more general to encompass a different data that we have. This other one is a housing package that is in R. So Puget is actually built in Python, housing is in R. And this code initially was developed from another collaboration at UW with Social Science with Seattle Housing and the King County Housing Authorities. And this code, what it did essentially, it cleaned up the administrative data from these two agencies and it created like a matched essentially database of these two agencies. That was maybe five years ago, but then it was adopted by the King County Public Health. Initially it was actually just one analyst who did further development to really, to match it against like public health data. So they expanded it to kind of fit the needs that they had at the agency. And we had a decision to make at, you know, for this project in the sense we were building our data pipeline now in Python. And we knew that technically speaking we could have essentially built everything in Python in order to clean the data that we had from these agencies. But really wanted to kind of practice what we preached, right? They had already kind of ran with this and developed it to fit their specific needs. And so we decided to actually use what they had already done, kind of develop this workflow pipeline of doing forks and remotes and doing pull requests on their repository that they had and kind of work with them to make it more general so that we can actually use it as well. So they were developing this tool for a very specific purpose that they have at King County Public Health. And so we continued to work with them. And I'm the point person on this one in terms of making sure that it's general so that we can use it. But whatever we add doesn't break what they need, for example, to do more internally. And so this is, this has been a pretty awesome experience. It's very hard and complicated, but it's actually, we know that the reward is going to be great afterwards, right? So part of the kind of the challenge when you develop open source tool is that are people really adopting them or they're using them on the ground? This was a case where they had adopted something and they were using it. And we just had to kind of work backwards and essentially it involved a really crazy rebase that I had to do. And that was kind of when I came into the team and they're like, oh, you know, are and you know some Python? Well, shoot, we should rebase this to kind of what what they have because we were behind for like two years on what they had developed. And that was a great learning experience. And I could say I did a whole rebase and I live to tell the story. And then lastly, there is this AWS utility. So we we interface our our work essentially is all in the cloud. We we use, for example, S3 buckets to store our data. And we need we needed a tool to kind of help us get our raw data kind of process said, do all the procedures that are impuget and in housing and essentially spit out this data table that is used for analysis. So this data table has a link records, for example. And we use that in order to answer some of the questions that the agencies have for us. And so it's like I said, the family functions that just processes the data from raw to linked linked longitudinal file, essentially, it does depend both on Puget and housing. And these are the data cleaning work courses of this pipeline. And then this really means that we do have to communicate and really collaborate with the folks in public health that maintain, for example, the housing repository that's an R. So we always have to a lot of it. And I'll talk a lot about this in near the end. It not only involved a collaboration, but it involved a lot of like training and education in terms of going to their office and like sitting down with them and saying, hey, this is how the like a fork remote kind of workflow works and how to do pull requests, how to do kind of issues, for example. And so this is also work in progress. And it's it's part of our collaborations. It's not just us providing data support and information. It's also providing training and education in terms of how to kind of collaborate with open source tools and technology. And I'll kind of show you a visual of how it kind of works. So like I mentioned, there's a housing repository and are there's a Puget and Python. These are both open source repositories. So anyone can look them up and contribute if they like. There's some useful cleaning function functions, for example, in housing, we're trying to actually build in testing now into the housing repository. So that's kind of work in progress. But we also have this our AWS tools. It isn't a private repository. Mostly because we're working with PI data. And we wanted to keep everything kind of secure and private. And so we have a private repository for that. And this essentially gets the the scripts that we need from Puget and housing and produces this final like data table that we use. And so at the end, it produces this analysis data table. This is reproducible version control, we can essentially go from raw to like data table that we were used for analysis. So on the ground, it looks kind of like, you know, we, we have this data that we got from our pipeline. And then initially we're using Jupyter. And, you know, maybe there's some R scripts and Python scripts that created, for example, tables and visuals that were then kind of put into a document that was shareable with our stakeholders. So in this case, it was like either slides or like a Google Doc, for example. And, you know, that was our stakeholders will get that information. Luckily, we maintain a very good relationship with them. So we actually meet with them every twice a month, for example, and we talk about kind of what they see in the report, what other things they want to want us to kind of look at or inquire if they see some weird data, like number errors, and they let us know. And so that's that's good. But we knew that it was not it was it was working, but we needed something that was a little better. And so that's when we started leveraging like our markdown. We wanted to have a way to be able to version control and reproduce these like reports that they were seen on their end. And so our markdown is work seamlessly with ours essentially a markdown kind of a form to create markdowns using our and then so this way we we can actually not only version control our code chunks that create the tables or the visuals, but we can actually also look at some of the texts that goes into these reports so we can keep track of these in a more version controlled way. We weren't we were happy with that, but we're like, okay, that's that's great. But what if we actually dump these reports into their own repository? Also, private positive, like I said, these are not meant to be public reports. And we wanted to wait for our stakeholders to kind of interact with this kind of open source version control pipeline and and kind of leverage kind of the tools that we're using. So we created a repository for those reports. We gave our stakeholders access to the to this repository. And then we we kind of had some trainings and talks about how to kind of use the issue issue system in these repository. So now they have access to the report, they read the report, they see something that maybe they didn't really see during our meeting, they might have read it after later. And so now they actually create issues of things they see in our reports. They tag people that, for example, our team is six people. And so we're all working on different questions that they ask. These are different remotes, for example, in this repository. And so they're able to tag the people that are working on this specific questions are able to tag lines that they see kind of questions or they have maybe they see a table and they're like these numbers don't make sense. So we can actually go back to our pipeline and kind of iterate over that. We push a button and then it spits out the updated report. Maybe not just push a button. I mean, it's it's like you kind of push a button, you have to push and, you know, pull, add, commit, push and then it works. And so and they're and it's like a work in progress, like I said, it's not it's not something our folks on the ground are used to interacting with like these interfaces. So then we need to kind of bring it up to speed and we do take responsibility for that. We see it as our responsibility to make sure that they're able to kind of interact with these tools. And so there's, I guess, double-sided arrows. So final thoughts. There's no final report that we hand out because of the nature of the data and the way that it's iterative and alive really. There's a lot of turnover in the agency. So new people come in, they have new questions. And so we're we're happy to kind of incorporate new questions. Communication is key and we take it very seriously. It's our our stakeholders provide the content expertise and the context of the data that we have a lot of technical skills, but we might not have the context or the content expertise to really know what the data is about. And we take that really seriously. Education training is a component of our collaboration. It just has to be and we're happy with that. So for example, last year in April in the beginning, we convened the group of analysts that are supposed to work with these data. And we created we adopted software carpentry lessons and kind of made them into data science for administrative data, for example. So we we kind of tweak them to address some of the issues that we had seen in the data. And so that was like round one. And the carpentry lessons were great because they're introductory. Now we're moving into doing more kind of advanced, more focused kind of lessons also kind of barring those that type of like car software carpentry lessons where we invite our collaborators and we really focus on the training and the tools for that so that they can actually contribute and use this when our grant is over essentially. Like I said, our approach is a work in progress. And of course, a great team that I work with. Bruna Hazelton is a research scientist at the East Science Institute. Her backgrounds in physics, every oral chem is data scientists and his background is in neuroscience. And then we have Tim, Tim Thomas who's a postdoctoral fellow at the East Science Institute and sociology. And we have Breon from Breon Haskett, a PhD student from sociology. And we have Luke Rodriguez from the iSchool, the information school. He's a PhD student there. And so, yeah, it's like pretty, pretty amazing team actually really, really learn a lot from everyone, like every time we meet. And we, we meet weekly every, every week. So once, once a week. So this is my information. Thank you. You have any questions?