 Thanks very much for coming. My name is Alex Storr and I'm the Director of Data Analytics and Research Computing at the Stanford Graduate School of Business. I'm Christina Maymon. I'm the Research Data Services Lead with Research Computing Services at Northwestern and we're here to talk to you today about Data Science Support Service. What it entails, what why social scientists need such a service, and what we've learned about making this type of service successful. So to start with, we're going to talk about an example researcher who we're going to call Sarah, and she has a research goal. She wants to compute political polarization between parties in the North Carolina State Legislature before and after the Civil War. She knows that you can compute political polarization if you can get roll call voting data from the legislature, right? Who voted yes or no on which bill? It's a standard political science methodology. So she sets out to get this data. After a good amount of research, she finds it. It's in the records of the State Assembly are each year is in a separate PDF and just sort of this long discourse of everything about the legislature for the year. She looks through the files and see that the actual voting data, interspersed in a lot of text, will sort of be a paragraph with lists of mostly last names that were the people who voted for a given bill. So the data she needs is there, but what she needs is a spreadsheet, right? To feed it in to get the political polarization scores and figure out where each of the members of the legislature fit on a political dimension. If Sarah and her RA try and extract the votes from the files into a spreadsheet by hand, it's going to take a really long time, right? They're probably going to make a lot of mistakes. And if she outsources the work to a transcription company, usually talking about about five cents per vote per line in the spreadsheet. So that's $50,000 in the case here. So a researcher Sarah knows that there are ways to extract extract text from documents like this, but she doesn't really know where to begin. She's got a background in quantitative methods and knows how to model the data, but her programming experience is limited. So should she take the time to learn how to do this herself? Should she try and find an RA with the technical skills? Should she get a grant to do the transcription? Is this project even worth it when she doesn't know what she's going to find and clearly it's going to take a lot of time and effort in some form? And this really is the problem that we're here to talk about. Working with big, diverse collections of data is just hard. It's hard in a general sense, not just for Sarah and not just for social scientists. It requires specialized skills to do well. Social scientists aren't the only ones who need support. There are researchers in other fields who need data science support as well, but social science is really in a transformative period where there's a lot more data available than there ever used to be, but there's still really big barriers to incorporating that data into research. Interesting innovative projects are not getting done because of these technical obstacles to incorporating the data into the research. And this fundamentally is why we need data science support available to researchers as a standard service. If your institution wants to support innovative, impactful social science research, social scientists need a reliable way to get assistance with the challenges of working with the data. So what does this mean? First, what do we mean by data science? It's a term that get used in a lot of different ways. Like a broad definition of data science that includes all phases of working with data. So starting from data gathering, preparation, exploration, includes collecting the data, getting into the right format, starting to see what's there, what does it look like? How messy is it? Data representation and transformation includes deciding how data is stored and how different pieces of data are related to each other. This can include deciding what type of database may be needed, determining the database schema, but it can also include deciding how to encode or represent information. So if you have text data, how are you going to transform that into variables and features that you can summarize and model? Computing with data includes a wide range of activities from writing scripts to accomplish other tasks, to choosing the appropriate computational resources to do the model that you're doing, to developing software packages to share your algorithms or your functions, and you often need multiple different programming languages for different parts of the process. Data modeling includes both traditional statistical modeling, where you care about the estimated coefficients and what the model looks like. And it also includes predictive modeling, often usually referred to as machine learning, where you care more about the outcomes and how well the model can predict on a separate data set. In data science in general, we'd also include things like data visualization, building tools to work with data as well. So we have the data science part. What does this look like as a service? Thanks, Christina. So so if we go back to Sarah, and we think about what what it would be like for Sarah to come engage the service on her North Carolina voting project. We'll we'll see how it starts. So first, she has to get in touch with us get in touch with the service. So in this case, she might email us, she might fill out some sort of intake form, so that the service and the people who staff it can know what the request looks like. Then the project will get assigned to somebody. And how that works will depend on how many people are on the team, what their skills are, and how busy their plates are. Then the staff who's assigned to this project will write a clear description of their engagement in the project. So this way, there can be an understanding of what the service will and won't provide a general timeframe and an indication of what's still needed in order to start. So now I'm going to walk through some of the steps for this particular project kind of under the hood to see what had to be done to get to the finish line. So we started by writing code to download these documents. And on the North Carolina State Assembly webpage, it's a little bit tricky, so you have to sort of reverse engineer how to download those documents. Then once you have them, you have to figure out how you're going to store them. So if you're downloading 100 documents, it's kind of no big deal. But if you're downloading a million pages, you really want to think about how they're going to be stored, so you're going to be able to do some sort of computing with them. Maybe when you download a million things, they're not all going to download successfully. So you have to keep track of that and make sure that you're ready to have a complete data set to work with. Then once you've downloaded everything, you have all of these pages, but not every page has the data you're looking for on it. So you have to identify the pages with the data and then which part of that page has the specific roll call vote text on it. But then that's not enough either because you have to kind of create a clean data set in this rationalized format. So Christina showed you the kind of snippet of the roll call votes, but each one of those names is not actually a unique identifier for a legislator over a hundred plus your period. So you have to build kind of a database of what's in this data in terms of both the votes, which were the bills, the legislators, who they are, what sessions they were in. And then once you have all of that data cleaned up, then you're able to use this legacy FORTRAN software that's used in the political science field to compute this political polarization. So in this case, you have your clean data, but the way it has to be represented to go into the software is sort of non trivial, especially if you're not used to working with some of these more esoteric data formats and you don't like reading documentation for FORTRAN code. So once we get to the end of this project, the service is going to deliver a professional summary of what they did. So that's going to include data. It's going to include code. It's going to include the results. It's also going to highlight any potential issues that came up along the way. There might be some things that were mistakes or things that weren't able to be downloaded. Reasons the data set isn't perfect. And then there's an opportunity to give feedback on the service to make sure that we're able to provide a high level of support. So for this whole process, when engaged with us, this took about 120 hours of human time. So there was a lot of work in terms of the computer doing stuff, but that sort of doesn't count. And if you compare this to using undergraduate RAs or outsourcing it, we're able to provide this sort of data set a lot more quickly and probably more accurately and definitely for less money once the service is employed. So we've talked about Sarah's case and why in this one circumstance it's valuable for her to engage a service. But in general, I think it's really valuable for such a service to exist. So we're going to kind of step through some of these reasons and make it really clear why we think that is. So one value to the institution of having a service is that we can tackle similar problems again more efficiently. So instead of having dozens of researchers all solving the same problem on their own over and over again, we have staff who are building expertise, we're using less time overall to do this and using less researcher time to do this when researchers are supposed to be teaching classes and writing, you know, grants and writing papers. Alex and I work at different schools, but when we discuss a project, we can quickly see the same types of needs and challenges that we've seen in other projects before. They have common components. So for example, one step in a North Carolina project here was the data gathering stage. And this is a general class of problem that includes web scraping, collecting data from APIs, or systematically downloading data from databases. It's one of the most common types of problems that we help people with. These projects require similar skills and steps regardless of the subject matter. So we've applied these skills to diverse sources of data, ranging from financial filings about public companies to movie reviews to newsletters that members of Congress email their constituents. They all have the same form of a problem. Similarly, another step in our process was converting text to data into a format that the researcher can analyze with the methods of their field. So this involves identifying the relevant parts of the documents with the information of interest, extracting and standardizing that data and then converting it to the format the researcher can use. Again, it applies across many different types of data. There's always challenges, unique cases, specific details, but understanding the general form of the problem and the likely issues, steps you have to go through, makes it more efficient for somebody who's done this type of problem before to tackle a similar one in the future. Right, so if you have this sort of high performing professional service, you can really expand the scope and ambition of the researchers and what they're able to do. So we can come up with these effective solutions and we can do it pretty quickly and we can deploy it on large data sets. So what that means is that if you are trying to code Facebook ads, for example, instead of hiring an undergraduate to watch Facebook ads and code them for are there people in this frame, you know, does this have Donald Trump in it or not, you can instead download all of these ads, figure out how to send it to a cloud service or do some sort of machine learning on it locally to extract the information you need from all of them. So it can be much more efficient. And then, of course, when you figure out new things as new challenges come along, you've kind of integrated those skills and you can deploy them really quickly. So Christina mentioned briefly that the job of social scientists is not to be data scientists, it's to be social scientists. So there's all sorts of things that, you know, a good social scientist has to do. They have to keep up with the literature in their field. They have to, you know, advise. They have to be parts of committees. They have to kind of be good scholars and part of being a good scholar is not going to be debugging FORTRAN code unless that's your field of interest. So it's really nice to have a service that can take this aspect of your research problem that maybe isn't that relevant to the substance of what you're researching and then extract the data in a way that makes it easy for you to pick it up and move on with. Having a professional service also means that the researcher doesn't have to try and hire a magical RA or postdoc or staff member who knows all of the necessary skills for their project. They don't have to become an employer, right? They don't have to write a job description, advertise it, interview and vet candidates for skills that they themselves don't have. It's a difficult task. They don't have to manage them when they show up. And they don't have to pick up the pieces when that student or postdoc leaves in a year or two and try and figure out how to transition the project to somebody else. You know, they're engaging with a service with reliable experienced staff instead. Beyond not having to hire people, it also means that researchers can get help with a wider variety of projects. It doesn't have to take the form of a project where I need one person to support it. It can be a smaller project where all of that overhead just wouldn't be worth it to engage with the project. It could be a larger project where multiple people might be needed over a short period and that might make a difference. Yeah, so we've talked about why having a service in general is really valuable, but I think it's also really useful to have a team. So let's kind of walk through some of what we see as the valuable pieces of having a whole team of people rather than just an individual. So by having more people, you can have people coming from different backgrounds and perspectives. So part of that is that you might bring different kind of technological expertise. You can have people who are graded optimizing code. You can have people who are wizards at GIS stuff, but then you can also have people that come from domain backgrounds that can prove really relevant. So Christina was just talking to me about a project where she had to collect Russian politics information. If there were a Russian scholar, that would be super valuable and obviously you can't get everybody, but if you have a team of people that kind of broadens the scope of what people have seen and what they're able to do. The other side of having one person not to know everything is that there isn't just one person who knows everything. You know, what happens if that one person working on your project is sick or leaves or is just busy when you have an entire team supporting the service and you have procedures in place for documentation and sharing information about your projects that isn't as much of a concern. Having a team also means that there's people around to collaborate with and get feedback from. This is really important as a data scientist, you know, working in an office all by yourself. You can get stuck on a problem for a good week if you don't have anybody else to work through it with or to see what their experience has been with a similar issue before. This leads to, you know, higher quality projects and also more efficient because you don't have people just getting stuck with a problem on their own. When there's multiple people in the same role, this also usually means that there's a position for a team leader, a manager. Some of this is just purely logistical benefit, right, of having somebody who is managing the folks directly and they aren't working for the researchers. But it should be more than that, right? So a good manager is going to make sure that the team is pursuing the right strategies, they're making smart decisions about the project, they're going to ensure that those people are developing, that they're getting the technical skills and the professional skills, that they need to provide better service. And there's going to be somebody who is developing networks and relationships with the university, right, who's connected to other services and people and who's looking forward to make sure that the service is offering, what the needs of the researchers are, is they evolve. Okay, so we talked about why the service is valuable, why it's nice to have a bunch of people on your team that are that are able to staff this. But there are still lots of questions about how it's going to be implemented and this is going to vary on a number of important dimensions. So, so there's a question of what does this service do exactly? So you can have, you could have a service that trains people to try and elevate skills where you have experts in certain things, you could have short consultations where people can come and ask for help or you can be part of a project for kind of a short period of time, or you could be on a project for a pretty long period of time and all of those fit into the scope of this kind of professional data science service. In terms of who the clients are, there's questions about do you help every single person at the school? Do you help students who are having problems in their classes? Do you help graduate students who are working on their research projects? Do you help postdocs who are really committed to a single project for a long period? Or do you only help faculty members? Furthermore, there's questions about who you want to make this service available to in terms of domain. So it might be available to one department, it might be available to one school, or it might not be bounded by that and it might be for a whole discipline. You could imagine a GIS consulting service which is available to everyone with GIS problems, but isn't just for somebody in the geography department or something. And then of course you could make it general, available to everyone at the entire university. Then there's of course the question of who pays for this service. And there are options here. So it could be centrally supported. So somebody at the library or in research computing or in some or other organization could foot the bill for this. It could be a fee for service model where you do charge back based on the length of your engagement to the researcher. Or the service itself could be written into grants to help support the service over time. So then you have to find people and there are questions about who those people are going to be. So they could be professional staff members who work full time on this sort of data science team. Or you could engage contractors for short periods if you have to expand what you need to do. Or you could have students who are involved, especially for these short term consultations as part of their growth and education and also to provide those skills. So I work at the graduate school of business. So for me, we mostly do consultation style projects. We want our projects to always have a clear beginning and a clear end. We try not to be involved in anything for all that long so that we can keep providing this consultation service rather than getting bogged down with the resources that we have. Our mandate is just to help faculty members. Of course this gets really fuzzy when faculty members say oh yeah yeah yeah help my graduate student it's my project. That's a different question. And we help only people at the graduate school of business. So there are 125 faculty members at the GSB in all fields in the social sciences. So it still feels pretty broad to us. But if you have to support the whole university I think it feels pretty narrow. Our funding is centrally supported, one of the benefits of being a business school. And then in terms of our staff we have almost all professional staff with some temps and outsourcing that we do when there's some extra data collection that needs to happen or something like that. So Kellogg Northwestern's business school has a research support department similar to that Stanford business school but I work with the university wide research computing group which is part of IT and we provide support to the entire university. Because we are providing support across all of these fields and from students to faculty our model looks different. Like the business school we are medium and sort of long-term project support so anything more than about eight hours of effort for a rough cutoff is for faculty. But we do consultations and training workshops and online training for students postdocs anyone and again focused on researchers here because we're research computing so we're not helping people with classwork but anything from undergraduate students working on you know their honors theses through graduate students or faculty researchers are included. Teaching workshops helps us expand the number of people we can help with the limited staff and we find that we're in a unique position to be able to provide this training that's not available to the students elsewhere in their curriculums or in their departments. Because we're supporting a larger set of people we do charge fees for some of our longer-term engagements or specialized projects or ask the faculty to find grant funding for the project. And in addition to professional staff we have student workers who provide consultation services to other students as well. Right so both Christina and I have been the professional staff on these teams and we've also had the opportunity to lead some of these teams. So we want to kind of walk through what we think it takes for this service to be successful. So the job itself is actually pretty hard. So there are important skills that you want these people to have when they come to work on a team like this. So they have to understand academic research. They also have to be able to collaborate with faculty members which I think is a completely different skill from understanding academic research. They have to have outstanding technical skills. They have to be extremely professional. You can't just be a you know a developer because you have to interface with people and they have to work with uncertainty and figure out lots of diverse things. So the job changes a lot. The projects are never going to be the same and there's going to be tons of new challenges to figure out. So I talked about these magic right people and we're not necessarily the magic right people but since we've done the job we'll tell you a little bit about our backgrounds. So undergrad I did a social science research major and a computer science minor. I've been programming since I was little but I got interested in social science at college and combined the two. Went to grad school for political science and I focused on political methodology and statistics. And then after grad school I knew I did not want to be faculty and I worked for a computational social science firm for about eight years before I found my way to this type of work. So I did my undergrad in computer science and cognitive science and then I got my PhD in computational neuroscience where I learned how to read a bunch of papers and work with faculty members and do lots of really esoteric data stuff, scraping brain websites things like that and kind of built my skills out of that and then transitioned into social science afterwards in this kind of role on one of these teams. So both of us have PhDs and it's valuable to have a PhD but it's definitely not necessary and we've worked with people who have gotten this research expertise either in industry or by doing RA ships with faculty members around other research projects. So there aren't that many people with both the data science expertise and the research experience that you really want to have in this role and these are skills that are in high demand. You're asking these folks to do complex work supporting cutting-edge research to essentially help faculty with the tough problems in the research that they can't handle themselves. So if you want to attract and retain these folks it's really important to respect them. I had a talk yesterday from Patrick here but about the turnover in research computing roles. They can be pretty high. So if you get somebody good you don't want them to go away. This means that you really do need to provide career pathways for folks with stable employment you know competitive compensation and opportunities to grow. This is not hey here's a one-year postdoc here's a one-year temporary position and we're going to try and get you as cheap as possible because those folks are not going to stick around. They're going to go get a data science job in industry. Moving beyond the people on the team we found that good project scoping and project management practices are really key to serve as success it's based on really obvious but we found that the fundamentals practices for this are often missing and they're not trivial to get right. The details here are a whole other talk actually that I'm working on but why are these practices so important? Properly evaluating a scoping a project is really important to avoid scope creep right? So research is messy and uncertain and goes in unexpected ways so scoping a project be willing to abandon a project and start a new one when the research direction changes is important. Actively managing a project helps build trust with researchers being clear about the communication what you need from them giving them the information that they need and then tracking project metrics helps you ensure that the service is going to remain sustainable and impactful and you can communicate what it is that you're doing and why the funding that you're getting should continue to come your way. Right so your service is also going to have to stay dynamic the work because of its nature is going to be continuously changing. So by providing this sort of service in a continuously changing environment it has to be super dynamic so your people are going to need time and space to figure out new things try new techniques read papers go to conferences learn what is state of the art and kind of cast a wide net and then of course the things that you are supporting might kind of shift over time as there's a new technique that comes out maybe people really want to do deep learning models and want to use GPUs to train them. So somebody on a team like this could figure out a good workflow for doing that and then reproducing it in the future. Or you know you want to do text mining but what if you want to do text mining on medical records suddenly it's a completely different problem where you'd like to be prepared for this sort of thing as it comes up so that you're not figuring out a year's worth of infrastructure planning as things happen. So being dynamic is super important. Furthermore your service is only successful if your people are going to use it and the only way they can use it if they is if they know that it exists. So you have to tell people what it is that you do and this is actually over and over again over and over again it's really pretty hard. So you have to kind of tell the story of what it is that you do because if you talk to a faculty member and you say oh yeah we help with research they're like whatever like I don't know what you do you're not going to help me and you walk through a project and suddenly the light bulb start to go off and they say like okay like this part is really hard you could just do that part and then come back to me oh that's that would be really useful. So you need to have a communication plan about how you're going to get the word out about your service. Furthermore you have to make sure that the communication is both ways. So a faculty members are like we are really looking for somebody to do XYZ and you guys aren't doing it it'd be nice to incorporate that into your service rather than just providing a static set of things over time. Then you have to be able to engage with other people. So outside of your own unit you have to be aware of the evolving policies and technology that are available on your campus. So if you're going to do text analysis on medical records is there a place on campus where you can store high-risk data is there a place on campus where you can do analytics on high-risk data that's really good to know if you're going to have to use it and you can only know that by communicating really effectively with the people on the ground in your school. And then if you guys saw the plenary opening session yesterday then you know that it's really important to be generous with your skills and I think there's a lot of leveling up that can happen on a team like this and you can generate lots of useful stuff. So useful frameworks useful code and it's nice to be able to share that back to the academic community in whatever way is effective for your group. All right we're near the end with a couple takeaways if you're going to remember a few things about our presentation. First is that there's lots of data in the world but working with it is hard. This is a challenge for social science that is only going to continue to grow. Also having a professional service for data science support is efficient and enables innovative research without clients getting bogged down in hiring assistants or learning specialized technology. Finally providing the service requires that you are attracted to retaining data scientists with research experience. These folks are looking for professional positions with opportunities to grow and use their skills. This isn't a service that can realistically be provided only by students or postdocs.