 Hi, thank you for being patient, sorry about the technical difficulties, but I'm Catherine, thank you for having me here today. I've actually been to, I'll talk about this a little, but I've been to 40 countries, but Spain is one of my favorite places to visit, so I'm really excited to be here speaking in Madrid at Big Data Spain. And today I will be talking about building a data science team from scratch. And I grew up in New York, so I picked up a little bit of Spanish, so my alternative title for today is how to start a data science team when everyone is still using Excel. I'm glad Excel translates, okay, but so a little bit about me, I put my Twitter up here and some hashtags if you want to tweet along to the conference, but I am based in New York City, I am a senior data scientist and manager for Code Academy, which is a startup based in New York City. For those of you who don't know, it's a startup, it's an online platform for people learning how to code. I've heard from a few people here who use it, which is always nice to hear. But yeah, I was also the company's first data science hire, and before that I have been doing statistical programming for over 10 years, which is really weird to think about, but six years of that has been in a professional capacity, and for the past four years I've been doing technical hiring and recruiting. So another thing I didn't mention is I recently switched jobs, so I started at Code Academy a little over a year ago, so for my talk today I just want to talk about lessons learned from both sides of the hiring table as both a job applicant and now as a hiring manager. So we have a lot to cover today, but my goal for the talk is that hiring managers will be able to effectively create a good job application process and attract good candidates and the right candidates for the role, and that job applicants will learn how to market themselves, how to crack the data science interview. I think there's been a lot of literature about cracking the coding interview, but for data science it's still fairly new, and a lot of companies are still trying to figure out how to effectively screen data scientists. So let's get started with, back to that first. So the first lesson learned is I think everyone in this room wants to hire data scientists or has worked for a company that's trying to hire data scientists or is trying to be hired as a data scientist, but the number one lesson that I've learned over the past few years is no one can come to agreement about what a job title and the job descriptions look like. So I don't know how many of you have seen this article, but back in 2012, Harvard Business Review published an article that called data scientists like the sexiest job of the 21st century, and this quickly went viral. So I pulled some data using the Google Trends package in R and plotted it in GDPLOT, and you can see the spike over here around the end of 2012, 2013, right after the article was released, and that's that inflection point when suddenly everyone was like, I need to become a data scientist. And you can see that data analysis has been fairly stable, if not kind of like declining over time. So a lot of people go back and forth between data analysis and data science, and I think as a data scientist, and I'm sure for a lot of you, a lot of questions that I hear from people is what's the difference between a data analyst and a data scientist. And I think it's all very relative. So job titles, I recently discovered Data Janitor is a job title. And I actually, I really want to get my title switched because I think it's an accurate description of my job, but because data science is such a new field, there isn't a lot of uniformity in how people, a data scientist in one organization might not be a data scientist in another, and even within one single organization you have data scientists doing a lot of different things. So one of the other trends that we've noticed, and I missed Paco's talk, but I think he talked about this too, is I think hopefully we're going to trend towards more companies standardizing job titles. So what we're seeing in the US and the states is a lot of the bigger tech companies. So at Lyft they just announced that they're turning all data analysts into data scientists and all data scientists into decision scientists. And I believe they did the same at Facebook, Spotify, Etsy, like a lot of big companies. So I think that will, with all these big companies doing it, will kind of guide the rest of the industry towards standardizing job titles. But for now, for now I'll talk about the types of data scientists. So I really like this description that the director of data science at Stitch Fix answered in a quora years ago actually, but it's still highly relevant where he kind of generalized it into type A and type B data scientists. And type A stands for analyzed. It's pretty easy to remember. So type A data scientists I think are the most commonly needed across organizations. So data scientists who analyze, right, they make sense of the data, they don't really push models in the production. Most companies don't really need to do that to begin with. But they still have the statistical knowledge to make statistical inference from the data. They also know a little bit of SQL, a little bit of R maybe, or Python to clean the data that isn't really taught in traditional statistics curriculum. So you're dealing generally with large data sets, and there are certain things that no matter what kind of data scientist you are, there are some standard foundational skillsets that everyone has. So data visualization, maybe eventually you specialize, and one of the biggest things is just communicating your findings all. So once you come to a finding, summarizing it in a way that makes sense. So type B data scientist builds. So again, there's a lot of overlap with type A, so maybe everyone can be a little bit of both, but the data scientists who build those are the ones that typically work with more ML models. It's a lot more common in the tech industry because they're pushing models into production. And yeah, like the data that they deliver ends up in the product. So you see that with Netflix recommendations, maybe like Uber, the Uber pricing algorithm, and so on. So the second lesson that I learned is just a lot of no matter how badly you want to do machine learning, the stage of your organization will determine what type of data scientist you can and should hire. So a little bit more about my background. Before Code Academy, I worked for JetBlue. It's an airline in the States. They don't fly to Europe, so they're basically like the Vueling or the Ryanair of the US. And they're the fifth largest airline in the US. It's pretty big, but basically over the course of a year and a half, I went from working at a company with over 17,000 employees to working at a small startup with about a hundred people, building out their first ever data science team. So that was a huge transition for me. And I learned that different companies, different stages of companies have different data science needs. So when I was working in corporate for six years, I felt like I was doing a lot of the same tasks over and over again. But I was also stuck with a lot of legacy infrastructure and software that served us blockers and meant I couldn't innovate as much. Whereas once I moved to a startup, a lot of the infrastructure wasn't there. So whatever I wanted, I had to build. I had to work with software engineers, data engineers. I had to request everything and basically start from scratch. So I think before you even start recruiting for a data science team, building a data science team, you have to think about what type of data science work that your organization needs and just be honest about it in the immediate term. What are we actually looking to achieve? So I think there are different stages of companies that I mentioned. And companies in early stages of data science will focus on data collection. So a lot of it is designing the schema, designing the database, making sure ETLs run aggregating maybe summary statistics or something overnight so you don't have to run those during the day. And you're really just building out that foundation for analytics. So in the past, another alternative job title for this is business intelligence or reporting. So you're making sure just there's basic visibility of the numbers, it's a lot of counting. And then when companies hit a growth stage, your data warehouse is maybe it still needs maintenance and work once in a while, but it's in good enough shape so that you can produce insights. So I would say that Codecademy is in growth stage. We're no longer truly a startup, we've been around since 2012. So the data warehouse is in pretty good shape, we're capturing event data. It's pretty clean, it could be better, things can be slow, but we can run experiments on things and have it reach significance. So companies that hit scale, that's when you get to deploy models and just scale up all the analysis that you're doing in growth stage. So maybe in growth stage you were building models about lifetime value or customer churn, but you're doing it more to understand the business and it's fairly static, less so it doesn't really go into production and change what the user sees. So just something to note when you're assessing the state of your data warehouse. When I say stages, it's not to conflate the age of a company with what stage their data warehouse is in, that's because one of the biggest things I learned from working at older companies is that a lot of these companies have a lot of technical debt. So coming from an airline if you ever, I think some of you flew here and fly pretty often if you go to the airport and you see the type of technology that airports work with, that was the type of technology that I had to work with at an airline. So it was really frustrating. I think data scientists always complain that they spend like 80% of their time cleaning the data and then 20% of it analyzing for me was like 90, 95. And we just had this really old legacy software and old infrastructure and it was prone to breakage and it made it really hard to analyze anything much less push models into production. So lesson three, now that we've gone over type A and type B data scientists, a lot of people ask well I'm hiring my first data scientist or what kind of data scientist should I get? Should I get someone who's into like natural language processing? Should I get someone who's more of a generalist who can do a little bit of everything? Where should I focus my efforts? And the biggest thing is when you're starting out it makes sense to go with a generalist and not specialize too early. Of course that might be different depending on the goal of your business if you're strictly doing NLP then it might make sense to get a specialist early. But a lot of this is because data scientists are expected to do all these different things. So have a little bit of expertise in all these domains. But in reality it's impossible for someone to possess all of this knowledge and at best you hire people who are stronger in different areas and you collectively build this repository of knowledge where you're sharing your code, doing code review, sharing ideas. I borrowed this from the head of data science at Airbnb I thought this is a really good summary as well. So data science analytics and data science inference would kind of fall under a type A data scientist. So a data scientist doing analytics might do more data visualization but not make any causal relationships with the statistics. So they might have a background in maybe like customer insights or usually you don't need as much of an advanced degree. So data scientists who do statistical inference you want stronger statistical knowledge if people are drawing conclusions and saying that the math backs it up they have to have sound knowledge of that math. And then the middle data scientist algorithms that would fall under are type B data scientists so this is where you're building algorithms and the algorithms are the product, the algorithms are what they're producing. So I'm going to switch track now into technical interviews so this is the part where if you're a hiring manager or involved in technical interviewing you can learn how to screen candidates and if you're looking for a job you can learn kind of how hiring managers are thinking about this and what they're looking for. So the fourth lesson and first lesson that I learned about technical hiring is that the hiring process where data science kind of sucks. Like it's not that standardized I think compared to software engineering where there's like a white boarding session kind of general things that everyone does a lot of data science interviewers are still trying to figure out how to even do that technical screen. And I think that the thing that I realized as I was looking at resumes is everyone claims to be an expert, expert Excel, expert SQL, expert I don't know, expert everything and some of these people are out of school so I'm like how can you be an expert at anything. So this is kind of what our hiring process looks like at Codecademy. Since we're small it gave us more time to be able to look through resumes admittedly but this is probably true if you're getting started with hiring data scientists I would recommend that you invest more time in it. So we start with the resume review and I try to screen out candidates who maybe the skills don't really make sense together so if someone has a lot of experience in Excel then they're more of a business analyst than a data scientist or it helps if someone actually specifies what are Python packages they work with. A lot of people put down R or Python they learn it in school but if they can list what packages they actually use, NumPy, Pandas, Tidyverse, DataTable that tells me that they actually use R in Python. So once people get past the resume review then we move on to the phone interview and this is when throughout this I'm screening for two things there's a technical skill but then there's still the business sense or really just common sense like you come up with these numbers but are you looking at things in a way where you know that there's something that we can do with it. So during the phone interview there are two questions that I really like to ask so I mentioned the first one, what's your favorite R or Python package and why? If someone mentions a different coding language I could ask it for that as well but I'm really just looking for a quick answer that gives me a better sense of how quickly someone can come up with an example of a workflow or a way that they actually use our Python. And then question two is where I asked for a case study, tell me about an analytical project you worked on recently and this is where I'm looking for someone who can explain things in both a technical and business way and kind of talk about it in detail. So during the same phone interview, phone interviews are about 45 minutes long I forgot to mention. So in the last 15 minutes if the candidate passes the two questions and I kind of get the sense that they might be a good fit for the role we'll move on to a pair programming exercise and you can't see the screenshot very well but this is what Coder Pad looks like. A lot of software engineers use it as well for remote pair programming exercises so two people or three people however many people can be in the window at the same time remotely and as you type you can see each other type so it's a really great tool for pair programming and we use it for a sequel because that's kind of the lowest common denominator that we look for for data roles. So I just look I'm mostly testing for advanced sequel queries, aggregates, joins, making sure people can get through it pretty quickly because we have to write a lot of sequel during the day. And then a second thing that I screen for but isn't always necessary is if the sequel queries are optimized so if people are doing select star all the time then I know that they maybe haven't worked with very large databases. I don't discount people because I don't think sequel is that hard to learn once you pick it up but it's a bonus point. I definitely recommend it to data scientists to get better at sequel because I've seen a lot of very bad poorly optimized sequel code from data scientists. So after the phone interview and the pair programming when people pass that we send home a take home project by email so we give over some toy data set from our company data so it's actually relevant. We want people to be excited about the questions that we analyze and we want to see how they handle like a real life situation. So we give them a week to send it back and I like to preface I tell the candidates in the email like don't feel free or don't feel the need to work more than three four hours on this we give you the week for it to be flexible because we're really data science is such a competitive field you want to make the process as smooth and flexible but also as you know you want to demonstrate the actual work that you'll be doing. So the findings should include business conclusions and that's really what we're looking for just sound conclusions about the data and we recommend that people include supporting data visualizations or models and people usually do. So when someone passes that they come on for the final onsite interview and this is where it's a work in progress we do in-person whiteboarding as well where since I already tested people in sequel and I can usually see someone's Python and R code in the take home project for the final onsite I usually work with them to talk about designing a database or schema. So let's say we're launching a new product and we need to log the clicks for this what would you want it to look like and just kind of work through that to see if they come up with an answer that's like realistic and make sense and that they can if you're making assumptions about the data that you're able to defend it and explain why and then they go through the typical one-on-one meetings with the rest of the team the meet and greets just to make sure it's a good fit for everyone. So lesson five so I mentioned this during the resume screen but something that I think a lot of people do that's a huge pitfall with technical hiring is limiting your hiring pool by focusing on job titles. So we looked at the graph of the boom of the term data science that only happened in the past decade. So I've seen job descriptions where people are like we want at least 10 years experience as a data scientist and that person doesn't exist. So yeah so I really recommend that we don't you don't focus on job titles and you assess candidates based on their skills. So a lot of industries don't even haven't adopted the data science title yet. I think government has been a little slow to adopt it in some older industries. It's common in the tech industry but like I said you if you limit yourself to looking for data scientists in the tech industry with 10 years experience you're going to be disappointed. But some alternative titles that you can look for are analysts, engineers, quants, specialists and as we saw before even data janitors. So for job seekers I think the same applies. So when you're reviewing job descriptions don't focus on the job title. I know a lot of people really want the data scientist title and it does provide a lot of mobility. But a title is a title and if you're not getting the work that challenges you to actually grow as a data scientist then it's not going to help you grow in your career. So don't just focus on the job title. Also a lot of the time job titles are negotiable anyway. Look at the responsibilities. So one of the things that I searched for instead of searching for job titles is I searched for the tools that I worked with. So I looked for SQL or Python clustering. These are some examples of things that you can put up if you want to be more specific although companies aren't great about doing this yet but sorry about that. More technical difficulties. But for interviewers same thing just don't focus on the title too much and write truth or job descriptions. So one of the things that I highly recommend that hiring managers do or HR or if you're involved in the hiring process in any way be really specific about think about if you were applying to this job would you get a sense of what you're actually doing out of this. So list different projects and tools. Yeah make it representative of the actual job. So lesson six. I really like this shift. I don't know where it came from but I thought it was funny. I also don't really know what's going on like who's who in the relationship but SQL Python are. So I think the biggest lesson about technical hiring as well if you want to expand your pool is to be flexible about what tools you're allowing people to use and the same applies once people start. So if Python isn't a strict requirement then in the job description put down Python are maybe even SAS SPSS but allow job applicants to use their preferred stack. And for job seekers use the tools that you're comfortable with to showcase your skills. So I made this mistake when I was applying to jobs where I was learning something new and I was really excited about it and I wanted to show it off but I wasn't that good at it and I didn't get that job. But learning is great but when you're interviewing for a job and it's job you want now's not the time to experiment with things that you're not super familiar with. So this is a time to really showcase what you're good at. And for interviewers be flexible. So I mentioned I communicate explicitly to candidates in the email feel free to use whatever you want and I in the email even say even though I made fun of Excel at the beginning of this talk even if you need to use Excel for some reason feel free and yeah sometimes that's okay. It depends on the role but for some of our entry level roles people are still weaning themselves off of Excel and if they're enthusiastic about it that's okay. They can learn on the job. So final lessons. Lesson seven even though Harvard Business Review thinks we are the sexiest profession. Data science isn't always the sexiest job. So as an example I actually didn't I originally wanted to do a different talk about modeling time series analysis and I came up with this great example of Google searches for Croatian Spain. And you can see how it peaks every other summer where Spaniards are looking to travel Croatia. I don't know why it's every other year though but I don't know that much about Croatia. But there is a sudden spike. There was an outlier this year because of the World Cup final. So this is a fun toy data set to look at but anomaly detection was something that I was using at work and I thought it was exciting to talk about. But on a day to day basis my typical workflow looks a little more like this where I feel like I'm constantly putting out fires and I'm sure a lot of you feel like this as well where the CEO or one of the product managers comes up to me and is like why does this number look like this like is something broken. And I'm doing a lot of that and managing expectations. But a lot of this work is really important. So the putting out fires the managing the day to day and also just planning out the roadmap for the team like I mentioned today. So I think just as a final takeaway for job seekers and hiring managers. Make sure you're you're prioritizing finding a good mutual fit. So the job market in the States you basically graduate from college and a starting salary for data sciences like six figures. So how can you compete with that right like you have to keep your data scientists happy and you have to make sure that you're offering your data scientists challenging problems or offering them visibility. You're letting them do good work and work that they're proud of and interested in. So I think when you're looking for a job really think about what type of role you want. So do you want to be a generalist or a specialist like for me. I like working on a bunch of different things. I specialize for a long time. So for me I really wanted to go back to generalizing and you want to work at an early stage or late stage company. All of those are personal choices that you should think about before you start your job hunt. And just be really clear about what skills you bring to the table. Now's the time to showcase what you can do. And for hiring managers I think just be realistic about what your company needs for maybe the next year. So don't think five years from now we want to be doing this type of modeling so we should hire this person. That's good for no one like it'll take you a lot of time to find that person that's not realistic. So really think of the skills that are nice to have versus you need them immediately. So for skills that are nice to have screen people based on the potential. Data scientists are really curious by nature. We learn really quickly. I have a lot of people in my team who learned so much on the job. And it's been amazing. So again ensure that you're offering mutual growth and satisfaction and keep your data scientists happy and interested. So I'm going to post these slides online. So there are some additional resources. And that's it. Thank you for having me. And I hope to see you all soon. But let me know if you have questions. I think we have a few more minutes. No, no more questions. Wait, do we have time for questions? Oh, okay. I think we're out of time for questions. But I'll be around. There is technical difficulties but you can catch me on this side. Thank you.