 My name is Sucheta and I am a third year PhD student in the iSchool at Syracuse University and I'm going to present my work that I did with Professor Jeff Solz within a span of a couple of years. It's related to the risk management framework for data science projects. So what a timing of this presentation soon after lunch but I would try to keep you away. It's a pretty short and sweet presentation. So before I delve further into the project, let's just give a minute or two to understand what risk is. So let's consider a few situations here that may involve risk. You want to open a new restaurant, you want to open maybe launch a gym or something. You want to go for skydiving or let's say you want to ask for a hike or a bonus with your manager. Now all these situations are disparate but if you come to think of it these situations have something in common. What do we have in common in all these situations? We have the exposure to a situation and we have the uncertainty. In terms of the conceptualization of risk we can say that okay there are three elements here that we can think of right. First is the undesirable effects or the outcomes that may be detrimental to the humankind and the ecosystem. Second is the probability if that occurrence would happen or not those undesirable effects would happen or not and second and the third is the state of reality that we have set for ourselves in a given situation. In terms of the conceptualization of risk we can say that okay we have these three elements but then these three elements stand in juxtaposition with the context we are living in. So when we define risk depends upon where we are coming from. We might be coming from a finance sector, we might be coming from real estate, we might be coming from supply chain management. So the definition or the operationalization of risk keeps on changing. So we ask ourselves three questions here okay then how do we measure those uncertainties or threats or risk right and what are those threats? Seriously what are those threats and the third point would be like how do we define the reality here? What is the concept of reality? Do we see reality with the protocols that we have set in in industry terms or is it related to the project or is it related to the people? So you know that also requires some sort of clarity. So by and large we can say that alright risk has you know cause and effect pair here. We are talking about the threat which is actually the cause and then when the threat is realized that we have the consequence or the impact. Now we wanted to understand what are those threats in data science projects right. So right of the bat we could think of you know some of the popular of you know commonly occurring risks we see in the market. One example that we could think of was the target incident right the woman shopper goes to target and she is pregnant and with the help of the big data predictive analysis target predicts that okay this woman is pregnant and they send marketing material to her family residence and the woman had not revealed this news to their family members. So with this example we can alright gauge that there is a privacy risk there is a reputational risk right and then there is a personal risk. Then we have the data consistencies like with the influx of structured and unstructured data in the market we see a lot of data inconsistencies. Then we have the regulatory risk now we have the GDPR and the CCPA that becomes a hindrance when we have subsidiaries all across the globe if we are domiciled in the United States. And then lastly we have the data privacy of consumers that we that I just touched upon could you please. So okay so we talked about risk right now let's talk about the risk management framework what do we mean by risk management framework this present definition has been extracted from the articles I just went through a few days ago but I want to talk about my definition of risk management framework like how do I define risk management framework I would say that risk management framework would enable me to first identify the project risks and second of all the measures that would help me mitigate or manage the risks we can never really reduce the risk to zero percent right we can always manage it we can always mitigate it to a certain extent. Now these are some of the bullet points are some of the international standards we have been using in the industry like Koso is an enterprise level risk management framework we have missed which is for cyber security then we have ISO and we have COVID. So in the interest of time I have just jotted those down let's move forward and I can't move forward that's happening okay. So we wanted to know that especially in the context of data science projects do we have any standard in place do we have any risk management framework in place so we did a thorough systematic literature review with the search through six repositories and with various keyword combinations we could figure out that okay we are talking about the standards all right but those standards are again you know like either generic like enterprise level or they are probably talking about cyber security or they are probably talking about cloud computing supply chain management but there is no such standard that we can take as a motivation that would help us frame our own international maybe framework for the company we are working with data for especially for data science. So here as you can see the the risk management framework especially for big data science project is zero and that paved our way to move to the next phase of our study. Now we wanted to go to the market we wanted to speak to the data scientists we wanted to know if they have to you know share their thoughts on how they are managing the risk how they are monitoring the risk and processing the risk elements so we developed two research questions for our study first is what is actually the risk management process for data scientists and how are they deployed during the execution of the project and the second question was that how are these risk elements identified monitored and mitigated so we filed for an IRB and we decided that okay we're gonna go do a selective sampling find out a few participants as you can see that the data the sample science here is pretty diversified we had set our criteria to only interview those participants who had more than two years of experience especially because we wanted to actually understand given the intricacy of risk management framework and the sensitivity to only interview people who had substantial years of experience in the industry with the industry column you can see that this sample size is very diversified in nature so we did not just restrict ourselves to private organization or public organization we went to government we went to healthcare we went to IT we went to management finance manufacturing entertainment so on and so forth so we had total 16 participants whom we interviewed for this study okay now we applied inductive analysis we had two sets of matrix one was the similar themes and the second was the dissimilar themes what we did is that we actually created bigger themes and then subsumed the smaller themes underneath it so we essentially applied the inductive analysis approach and these are some of the findings we could think of that helped us understand how the data science risk is actually managed by these guys so let's just take a look at the first one ID one the participant one is talking about how the project management is used as a skill to manage the risk now ID one is from defense agency and this guy he says that the team works as a venture capital yeah that they would gather together they would brainstorm the questions and then they would apply the project management of a framework to manage the risk they have their proprietary risk management framework second one is talking about the seven step success criteria he works for a conglomerate and he's he's saying that they are documents heavy they always rely upon documentation and they have their brainstorming questions to start from when they think of executing a data science project and then so on and so forth so we had total 16 risk management processes for data science project another theme that we could think of was to create two categories that basically encompassed encompassed the actions that we take to identify the risk and then second set of actions were related to how they minimize the risk right so as you can see that they were pretty much about asking questions they always asked questions so right at the commencement of the project they would sit together they would brainstorm and that would basically pave their way to move to the next step of the project so asking question was all but just ID 14 who did not talk about because this guy was a sales manager and they had recently you know launched the data science project a few months ago so they have this ad hoc risk management framework going on their team and then they have their own ways of minimizing risk they talk about the project timelines they talk about the cost risk they talk about the opportunity risk budget risk so on and so forth these are some of the codes that I extracted through the trans the transcription so the first guy is talking about how much they are dependent on the documentation because then they are careful and mindful of the errors that they made in the earlier projects so the next time when they have a similar project they would not commit the same mistakes but still again come to think of it they are unaware of the unknowns they can only document the things that they know and they can see but they cannot write about something that they still haven't encountered yet and that's where we we find the need of having a framework which would create cushion for the unknowns right second one is talking about the frequent discussions that they have in the beginning of the project which is again context specific because the questions cannot arrive until we know the specificity of a project so there those were not standard questions or gold standard and the one of the participants talked about the data privacy and the GDPR like how important it is for people to understand that okay we have a branch we have a legal entity of of a company but then we have subsidiaries at the other parts of the world so who are we giving the access to the data who are those people do we have the permission are we following all the regulatory constraints so when we talk about data science we also have to talk about the regulations in place then another finding that we could come up with was related to the structure right like some of the data scientists were talking about ad hoc structure some of the data scientists talked about a structure that was pretty much resonating the standard international framework we have in place for the industry some were centralized some were hybrid and some were decentralized these are some of the course I extracted through the interview analysis they talked about when it is you know custom framework like they take basically bits of the international standard and then they curtail it then they customize the framework basis the requirements of the project then they have the centralized hierarchical framework where you know the subsidiaries create some sort of child-parent relationship with the legal entity and then they have the decentralized and the hybrid other finding was related to COVID how the data scientists were transitioning through COVID initially they were working at work you know like at work they were they were going to the office but later on they started working from home so what were the changes they could see themselves during that transition so one data scientist talked about enrolling to various online courses through LinkedIn some of them talked about the fatigue that these data scientists are facing because you know they are working all the time they are not getting any breather some of them talked about being prolific on these professional networks they would share their own ideas and then they would get feedback from the rest of the data scientists who are a part of the connection so that's how they were sort of creating a collective action with each other in terms of the limitations and the future steps one of the things that we figured out is the unavailability of women participants for this for this study we tried and we contacted them but somehow it did not really work out so my question would be why why did they not participate was there any reason and then what I would like to ask is how is risk perceived differently by gender how a woman data scientist perceived risk vis-a-vis how a male data scientist would right that would be maybe my future research to look at another limitation I would say is other small sample size I mean we tried to diversify it as much as possible but we just had 16 participants so in terms of the future step we would like to expand the pool we would like to invite more and more participants who can solicit their opinions to us regarding the risk management framework as a part of the future step what we are trying to do is that we are trying to create a survey instrument we have now the findings in terms of the themes we would create a list of those themes and send it to the participant and seek their advice with different scenarios we would like to ask them that okay if we change the scenario a little bit do you think that you would still resonate with the existing risk management process if yes then why if no then why and then with those interviews we are going to suss out the common elements and try and see if we can create a prototype of a common or enterprise level risk management framework for data science projects and then maybe later on we are planning to have a focus study group with the data scientists so this is pretty much it and yeah initial findings of my study the floor is open for question was one that came in I think it's been asked to the other speakers as well but what do you think are the most important skills that new data scientists not scientists should have or if they don't have them should look to acquire um hmm so I would actually resonate with what Neeraj mentioned that a good know how of data ethics is extremely important many a times and this was one of the things that the data scientists who was a hiring manager mentioned that students for the hustle that they have to get a job would mention some of the skills as keyword on the resume so that you know those resumes are picked up by the recruiters but at the same but at the time of interviewing them they figure out the hiring managers figure out that they just have a superficial knowledge so I would like to say that one thing you have to keep in mind that having hundreds of keywords would not really build up or embellish your resume be very truthful you know to your skills so you have to be a little honest when you are building up your resume second of all data literacy so there is a if you actually see the domain of data scientist you would find that there's a scarcity of right skills right so data literacy is something which is extremely important how much you speak with the data like skills can can be acquired but the main the game the name of the game is to actually analyze the data right that's where the entire gamut of data literacy comes into play so that's that's important um what else I think those were some of the things I would suggest any other question yeah day-to-day activities of a phd student would be start with reading and end up with reading and maybe in the middle go for a break have a nice meal with your friends and be thankful and have a lot of gratitude that's all and yeah I mean depends upon I mean the size is philosophical answer depending upon the aspiration that you want to go for it's very important for you to build up connections like that's how I built up connection when I was working with Jeff I was also at the same time looking out for internship opportunities although I come with 12 years of experience with the industry it's actually through one of the projects I was doing that I got the internship opportunity so that's something you need to kind of build up depending again upon your aspiration if you want to go to industry or if you want to go to academia that's important so Brutal is asking how how do I keep in touch with the industry being in academia did I get your question right so how do I keep in touch I would say that for me it always works out well when I am in touch virtually with all my ex-managers like I'm still in touch with all my ex-managers back in India and even the connections I built up here with Bloomberg I always try and find opportunities to collaborate with them even if it is just a very silly little thing but it's important to let them know that you know okay you are in the game so even like okay like happy Thanksgiving or happy new year or happy birthday doesn't matter you just need a way to like keep connect you know like have that sort of connection with them so I do that and secondly there were a few informal groups like abilities group queer groups women in tech groups that I was in touch with I was also looking out for cross-border opportunities within the company that kept me connected with other teams you know other than my own core team so that actually expanded my network I just try to like keep in touch any other question yeah go on um yeah I was I was actually managing a team in Deutsche Bank prior to joining the academia was I was managing a team of four in Bombay and what I figured out is that even though we had COSO framework in place there was there were still loops and holes that we did not recognize possibly because we live in a denial we think that okay there is a framework we are doing our job so what like it's working out very well for us but there is an unknown that we don't know about and the interesting thing is that these data scientists even had the same sort of mental map they are talking about what we have right now in the present but they are not thinking of the unforeseen unknown risk that might come over to them right so that became my motivation actually