 I'm really excited to talk to you today about Hack Weeks, but I know we've all been sitting for a while, so if you feel like standing for a moment, I'm going to take a page out of Alex's playbook and say, we've been sitting on those hard chairs for a couple hours now. So I also have this photo up here to kind of mentally cool you, given the heat of the day. Okay. And you can stay standing while I do my little intro here. I'm Mikaela Parker. I'm the program coordinator for the MSDSEs of the Moore Sloan Data Science Environments but I'm also first-timer to CSB CONF, so I want to say thank you to Danielle and the other organizers. This is a really fun meeting and I have learned a ton. I've met some amazing people and I'm just really grateful for being part of this group. So if you are not familiar with the Moore Sloan Data Science Environments, we are a partnership that started in 2013 between two funding agencies, the Gordon and Betty Moore Foundation represented by Chris Menzel and the Alfred P Sloan Foundation represented by Josh Greenberg. And those two foundations funded data science initiatives on three campuses at UC Berkeley, the Institute for Data Science, commonly known as BIDS, at the University of Washington, the E Science Institute, and at NYU, the Center for Data Science. And this partnership, the motivation was really in recognition that as data is increasing in all fields and in all forms, that even some of the very best researchers, especially on university campuses, were really struggling to generate knowledge and insight from these data. So there's essentially this gap between university domain research and data science best practice and it was this gap that the partnership sought to address. So the mission was really to build bridges that would enable researchers to learn, use and teach data intensive research practices. And the idea was really to create a feedback loop. So as we bring these data science practices to the university researchers, this would enable them to make new discoveries. But that would also spur new questions and really push for new methods developments to occur and these would feedback and then enable even greater discovery. So the goal here was kind of this feedback loop. Those of you that are familiar with the MSDSEs are also very familiar with this figure and some of you may be sick of it by now, but I think it still illustrates really nicely the bridges that this initiative was trying to push through. And they were loosely grouped into six themes, career paths and alternative metrics, education and training, software tools and environments, reproducibility and open science, which is something I know is important to many of you here. So, I'm going to talk about training spaces in culture and ethnography and evaluation and this last one was renamed data science studies. And today I'm going to talk about a program that really is at the heart of these three. So education and training, software tools and reproducibility and open science. But the real authors and the real work courses behind Hackwakes are in this slide. I'm just kind of presenting their work today. And a lot of the take home messages can be found in the paper that I've cited at the bottom here at all 2018 in PNAS. The authors are, their photos are up here. Daniella was the lead author and there were representatives on this paper from all three of the MSD institutions and Daniella herself is actually a cross-pollinator. She started as a post-doc at NYU and then is now at the University of Washington with an affiliate position at the Science Institute. I'm also highlighting on the right-hand side, Christina and Nicoletta. These are the leads of the latest Hackwake to join the group, the Water Hackweek. And I'm going to be talking a little bit about how they've kind of evolved and taken some interesting approaches to change the Hackweek structure. So, Hackweeks. This is essentially community building within domains. So many of you are familiar with hackathons or hackathon style events. Hackwakes borrow from this concept, but they add a really strong component around learning and peer teaching and learning. And pedagogy, sorry, around learning and pedagogy. So in this respect they lie somewhere between a summer school and a traditional hackathon. The components of Hackweeks include, as I said, because of the learning, lots of tutorials both on introductory and state-of-the-art methodologies. But also like more traditional hackathons there is a component on participant-driven project work in a collaborative environment. And finally, as I alluded to, this emphasis on peer teaching and peer learning. And I think this is really what fosters the community in Hackweeks, because it enables conversations across technical abilities, across career stages, and it really catalyzes the group in this learning environment. And also similarly with hackathons, Hackweeks take advantage of shared language and shared scientific objectives. Some of the goals of Hackweeks are to foster data analysis literacy within research domains, cultivate best practices, for example, around reproducibility and open science, and of course develop resources for an existing domain-specific community. And the end result is that these folks that go through these Hackweeks end up building a network that lasts far beyond that week and very often establish longer-term collaborations. So the Hackweek as a model for teaching and learning really came out of the astronomy field in 2014. It was then picked up a couple of years later by the neuroscience community in the form of NeuroHackweek. And then the geospatial sciences also created a Hackweek and GeoHackweek. But since then, these Hackweeks have really grown and evolved. So AstroData Hackweek, it used to be called now AstroHackweek. Last year when international, they were hosted in the Netherlands. This year we're going to be at Cambridge University in the UK. NeuroHackweek secured funding from NIH to become a two-week summer school called NeuroHackademy. And just last year, the oceanography community joined the fold with the first Ocean Hackweek. They're hosting another iteration this year. And this year, as I mentioned earlier, the latest domain to join Hackweeks is Water Hackweek. So they focus on freshwater resources globally. And the important thing here is that the Hackweek structure is really flexible and can be adapted to the needs of the different communities. So here I'm highlighting, I hope you can see this, the schedule for day one from GeoHackweek. And in this particular domain, they were really interested in getting right into the project work. So the very first day, they spend some time on tutorials in GitHub, Jupyter Notebooks, working in the cloud environment, but they're already pitching projects to work that afternoon. In contrast, NeuroHackademy, which admittedly is two weeks long, but for the entire first week, it is nothing but tutorials. They don't even start project work until the middle of the second week. And unfortunately for the format here, they listed vertically instead of horizontally. So I hope you can read some of those. But you can see that there's definitely introductory material on the first day. Again, GitHub, but also introduction to Python and R. They've borrowed from the Carpentries, but then they quickly jump into more advanced topics. And as I mentioned, WaterHackweek started taking a slightly different approach. They recognized that their community had really diverse and also very specific and technical types of tools that they wanted their participants to have a grounding in before they began. So what they offered were a series of training seminars as pre-work. These were the dates at the end of those lines. WaterHackweek was held at the end of March. So this is almost two months of opportunities to learn and get a grounding in some basic tools before the community met and worked through the projects during the Hackweek. So now I want to spend just a moment on participant diversity because this is something that's come up a lot in this conference. And I think it's something that we all share a really passion for. As these Hackweeks grow, there are often more applicants to the Hackweek than there are spots. It's really important for Hackweeks to kind of maintain a small community. 40 to 60 is about the ideal size. But some of these Hackweek organizers are receiving applications on the order of 100 to 200. And so to start thinking about, how do you choose participants when you want to maximize diversity in a Hackweek becomes kind of a hard cognitive load. So Daniella Hoopenkoden, the one who's also the first author on the P&S paper, wrote this tool and it's available on GitHub really well documented called Entrofi. And unfortunately when I told her I was going to talk about this today, she said, oh, I got to get the paper written. This is great. This is great incentive. And I felt terrible because I was pushing her now to like throw a paper out as fast as possible. She even asked me what time my talk was today. And I checked, I don't have the archive link yet, but she did submit it and it should be under computers and society section. So the idea with Entrofi is to take participant selection and treat it as a discrete optimization problem. Again, really relieving the organizers of that cognitive load of trying to juggle all the different attributes and factors of the people you might want to invite to your Hackweek. Daniella also wanted me to make sure to say that Entrofi doesn't make participant selection fair. It just lowers the cognitive load. It's still on the user to define the attributes of the community that you want to include. For example, junior versus senior researchers, experts from various disciplines, race, ethnicity, gender, and then also set the proportion of the targets that you want to include. So I'm going to give you just a real quick example and I don't know how well you can see that. This is a plot showing the output from Entrofi when it was used in a conference called Python and Astronomy. And again, there's a GitHub link there and it's all very well documented including all of the... I mean, it's completely transparent all of the participant selection. In this case, the plot is just showing geography. So where did the applicants apply from? And the organizers decided to group geography into three loose bins, North America, Western Europe, and all others. And the blue bars are the proportions. I realize you can't see the y-axis, but it's a fraction. The blue bars are all of the applications they received and the green bars are the recommended population that Entrofi returned. The black line is what the... along the top there is what the organizers set as their target. And again, I know you can't see that, but they arbitrarily set it at 33%. So they wanted a third from North America, Western Europe, and a third from all other countries. And you can see Entrofi did a pretty decent job. Very often, though, it won't be... it won't hit those targets exactly because, of course, it's also trying to optimize on a lot of other attributes. And so the user can also say, to me, it is more important that we have equal gender diversity versus some other attribute. So the user not only selects the proportions they want in the final population, but also the weight of importance of those attributes. So that brings me to choosing the right targets. Third, third, third from North America, Western Europe, and others may seem kind of arbitrary, which is why I really like what Water Hackwick has chosen to do. And they say here that accelerating scientific research and innovation on complex global water challenges requires a workforce with the diversity of the global population. I thought that was really cool. So they essentially set all of their attribute fractions proportions based on the global population. So I encourage you to check out Daniela's GitHub repo and try that tool yourself. Okay, so now I want to move on to how successful are these Hackwakes at their goals. And so those original three Hackwakes, Astro Hackwick, Neuro Hackwick, and Geo Hackwick ran a series of identical exit surveys. I'm going to show you some of the results. This first set of results focuses on how much do these Hackwakes in the participant's viewpoint, how much did the Hackwakes help them in their science? So I'm going to ground you to these plots first. So the questions were phrased as, there's a statement along the top and then the participant would either agree or disagree. And I can see you squinting. So I'm going to explain on the far left folks that strongly agree with this statement. Far right is strongly disagree. And then there is one box at the very far right that I don't know. The y-axis is the fraction of responses. And I'll go ahead and read them across, but I wanted to ground you on these plots first because I'm actually going to show you the data as each separate Hackwick, but I think what I'd rather for this talk is that you just kind of get a sense of the general trend. So for example, in the first box, I hacked on topics, tools, or methods that were very new to me. In general, most of the participants kind of agreed with this. There were a few that disagreed. In the center plot, I believe the Hackwick helped me be a better scientist. There's a strong positive signal there. And then I feel like I learned things which improve my day-to-day research. So again, strong positive signal there suggesting these Hackwakes are helping the researchers in their actual science doing their work. But in addition, one of the goals of the Hackwakes was best practices. And in particular, we're curious about working open. So in this case, the plot on the left says, I'm embarrassed to put my code and data online. And there's a fairly even distribution here. So as you can see, there's still a sentiment that people are genuinely embarrassed to, as many of us may be, to put our code and data online. Similarly, in the middle plot, I am afraid that if I put my code and data public, I will be scooped. Again, there's some... A lot of the responses disagreed with this statement, but there are still some that agreed with it. But now what's interesting is if you contrast that to this question, I feel scientists have an obligation to make their code and data public. There's strong agreement with this sentiment. So the researchers in these Hackwakes genuinely feel like it is their obligation to make their code and data public, but they're still hesitant about perhaps how to do it, or they're embarrassed. And so I think there's an opportunity here to kind of help shepherd this community into those reproducible and open science tools. And during the Hackweek, they definitely got some of that best practices and help. So in the left panel here, I put code and or data I created at Hackweek up on GitHub or another public repo. So definitely many of them did this. I feel like Hackweek made me more comfortable with doing open science, strong agreement there too. So not only are Hackweeks helping researchers with their actual research doing better science, but we are also slowly incentivizing them and convincing them to work openly and to adopt some reproducible best practices. So as I mentioned earlier, all of these data and more are available in this paper in PNAS. And before I wrap up, I want to also talk about a flip side to this, which is community learning across domains. So Hackweeks, as I mentioned, is really about bringing community together and learning within a domain with shared language and shared scientific objectives. XD communities, XD for across domain, are opportunities for researchers from all different domains to come together and work on similar problems, together identifying common principles, algorithms, and tools. So it's a much more methods focused community building. XD communities largely exist at Berkeley, but they have been hosted elsewhere. XD communities and the groups host seminars, they write blogs, but importantly they also run these workshops based heavily on the pedagogy from the Hackweeks. They're shorter, two to three days, but they also include, just like Hackweeks, tutorials, make sessions, and in this case also talks by experts. So in the inaugural Hackweek, sorry, I'm used to saying Hackweek, in the inaugural image XD, this is the image processing across domains, which was in 2016, 50 researchers from 11 different institutions came together to work specifically on image processing problems. And I just listed a number of the different fields that they came from, and you can see there's everything from computer vision to earth science, neuroscience, astronomy. So they're coming together around the tool rather than around the domain. And this concept is also expanding. So there's now text XD for text analysis and graph XD. Both of these were held more recently. Some of the example outcomes from these workshops include blueprints for open source image processing, in the case of the image XD, training sets for ML applications, and of course analysis projects. So the takeaway here is that informal, intensive, community-driven learning, with a lot of words, informal, intensive, community-driven learning, like Hackweeks and XD workshops, quickly and effectively bring data science to campus researchers. And importantly, emphasize that data science is for and by everyone. It is not owned by computer science or computer science plus statistics. Anyone from any field can contribute to the advancement of data science. So my last slide is a plug. As the coordinator for the morse on data science environments, that partnership is starting to wind down, and so I'm starting to think, what is the next thing? And I'm really focused and continue to be heavily invested in academic data science. And I think community-building for academic data science still serves a great need. And shamelessly, I actually just want an excuse to continue seeing all these wonderful smiling faces behind me, which are the folks who attended the last MSDSE annual summit. So my next initiative, which is currently tentatively called the Academic Data Science Alliance, or ADSA, I'm hiring a community coordinator and a program assistant. You can ask me about that by going to adsupply at msdse.org. I also left job descriptions on the table outside. It's not posted online yet, but it will be soon. So shameless plug, I am hiring. Come talk to me if you want to work in the space. And I'll take any questions. Thank you.