 Hi, this is Brandon Rohr with some thoughts on data scientist archetypes. In other words, what kind of data scientist do you want to be? When you first come to the field of data science, even if you've been in it for a while, you can be easily overwhelmed by the number of tools and the vast range of skills required. Blog posts about what it takes to become a data scientist devolve into laundry lists of skills that you don't have yet and tools you haven't learned yet. This is discouraging and it's counterproductive. My goal here is to give you a map for navigating the sprawling terrain of data science. It's to help you prioritize what you want to learn and what you want to do so that you don't feel lost. These ideas are not mine alone. They're based on a series of interviews with the data scientists listed here who are from several different companies, as well as contributors from academic data science programs. The underlying ideas are captured in a public document offered as a resource to academic data science programs from industry data scientists, an attempt at a virtual industry advisory committee. Any insights you get from this material that I'm presenting here, I credit to this group, but any errors or inaccuracies are probably due to me alone. Overall, this is a compilation of a lot of people's ideas who have spent time thinking deeply about it. The real reason data science feels so big is because it's no longer a single field. There are actually three separate pillars. There's data analysis, which is taking raw information and turning it into knowledge that can be acted on or that can drive a decision. There's data modeling, using the data that we have to estimate the data that we wish we had. And then there's data engineering, which is taking these analyses and modeling activities and making everything work faster, more robustly, and on larger quantities of data. So these three areas are all distinct. They all require different skills and different sets of tools. If you feel overwhelmed when you approach data science, it's because you're actually approaching three different fields at the same time. Now let's dive a little deeper into each one of these. There are several aspects of data analysis. The process of taking data and turning it into information that we can use to make a decision. There is domain knowledge. This is the ability to translate a business need or a practical concern to a question. It involves making trade-offs between how accurate the answer should be and how much time and money you want to spend finding it. Domain knowledge is also absolutely necessary for interpreting the data, for understanding not just the numbers and labels presented, but what they reflect in the real world, and what we can realistically assume about their accuracy. Another important part of translating information into actions is research. This can be as straightforward as performing the right Google search and reading a few documents. Often it involves instrumenting processes that are already in place, like logging code, or adding sensors to a production line, or conducting surveys. For some applications, research involves careful design of experiments. It requires using sophisticated statistical tools to plan what information will be gathered, how it will be gathered, and how it will be analyzed after the fact. Finally, an important part of data analysis is interpretation. Given a large collection of data, how can we summarize it, aggregate it, visualize it, and use statistical summaries to go from a sea of bits to a nugget of knowledge? In particular, I want to call out the art of visualization. Turning numbers into a picture that is easily interpreted and conveys useful information is an art as much as a science. It involves an intuitive sense of human visual perception and understanding. So the pillar of data modeling also has several major subcategories. This is often referred to as machine learning, but it's no more or less than the creating of a simplified description of your data that you can use to make estimates for data that you haven't measured. Either something that you didn't have a sensor for or something that's going to happen in the future. The three main subclasses of machine learning are supervised, unsupervised, and custom model development. Supervised learning uses labeled examples to discover patterns. These can be categorical labels. For example, did this customer revisit our website within the month after making a purchase, yes or no? Examples can also be labeled with numbers. For example, how many views did a video receive? Using a large collection of labeled examples accompanied either by their category or value, supervised learning algorithms can distill out the underlying patterns. If the labels are categorical, this is called classification. And if the labels are numerical, this is called regression. Also, sometimes the goal is not to make a specific prediction about future examples, but just to determine whether they're in line with what's been seen previously. This is called anomaly detection. A second major area of machine learning is called unsupervised learning, indicating that the data doesn't come with labels. We don't know the outcome or the right answer, but in unsupervised learning, the goal is to discover patterns in how the data is distributed. In clustering, groups of data points that are nearer to each other than to the rest of the data set are identified. These are the natural groupings in the data. For example, this can be done to find shoppers that buy similar items at the grocery store. In dimensionality reduction, groups of variables that tend to behave similarly, or in a coordinated fashion, are identified as well, such as the set of genes that work together to produce a desired trait. In dimensionality reduction, it's not the data points that are being clustered, but the variables. It's not the rows of the table, but the columns. In applied data modeling, it's very common to have to do custom algorithm development. Both supervised and unsupervised learning methods are general. They begin by assuming very little about the source of the data, what it means and what domain it resides in. General algorithms get away with this by sheer statistical power. They're at their best when you have a large collection of data. However, in practical applications, it's often the case that we don't have nearly as much data as we would like, and we don't have enough to rely on naive machine learning techniques. When this happens, we get to do custom algorithm development. We build in the information that we know, or feel comfortable assuming, about the domain. For instance, one important modeling problem in the agriculture industry is to predict the yield for a field of corn. That's the total weight of the harvested grain. This is an unfathomably complex system with soil and texture chemistry, genetics, temperature, and moisture all interacting. And it takes a year to collect a single data point. There's no feasible way to learn a complete crop model using naive machine learning methods. However, geologists, chemists, biologists, meteorologists, and geneticists have all investigated and described the pieces of this system. And so by assembling what is known and making some reasonable assumptions, we can create a tractable crop model. Moving on to data engineering, this is the third pillar of data science. It has a few major areas of its own. Data management is the exercise of storing, moving, and handling data. It's trivial for data sets of a few thousand or even a few million data points. But when individual data points become very complex, like an individual's genome, or the number of data points gets into the billions or trillions, data no longer fits inside RAM on a laptop. And you have to use an entirely new set of tools. Another big task that data engineers have is taking code that works well in prototype and making it ready for operation in the wider world. It is one thing to build a model in Jupyter Notebook, but to be able to use that to process and to filter every incoming customer request. The code needs to be made compatible with the rest of the company's code base. And it needs to be able to ingest data and publish results in the appropriate formats. Any glitches or failures need to be handled gracefully, and the system needs to be made secure against mischievous or malicious users. The gap between a smoothly running prototype and production code is a big one. And part of data engineering is the process of closing it. In both data management and production, software engineering plays an important role. That's the act of building things out of code that not only do what you want as fast as you want, but also can be shared and maintained across your current team, adapted to future changes, and scaled up or down as needs evolve. This is a craft, and it comes with a large collection of tools and specialized skills. There's also a surprise pillar number four, data mechanics. This is the dirty work that everyone needs to do, but nobody likes to talk about. It includes things like data formatting, making sure your types are consistent, getting information you want out of strings, making sure the strings in use are valid, handling errors gracefully. It also includes value interpretation, handling dates and times in an interpretable and consistent way, responding correctly to missing values, making sure units of measurement are consistent and documented. And it includes the fundamentals of data handling, querying, slicing, and joining. This lets you pull the subset of data that you want for an operation from the great lake of what you have available. Now we have our four pillars in place, data analysis, data modeling, data engineering, and data mechanics. What does this mean for you as a data scientist? These areas are all very different. I propose that a data scientist is someone who has, at the very least, exposure to all these skill areas. That means that you're not necessarily an expert or even good at them, but you know what they are, and you know the right questions to ask, and the right Google searches to run. A beginner data scientist might have just a little bit of experience in each of these pillars. With this as a base, we can walk through some data scientist archetypes. It's my hope that these will be helpful to you in planning out your training and career trajectory. One path forward to becoming a data scientist is to become a generalist, to develop a solid mixture of proficiency and analysis, modeling, engineering, and mechanics. A nice balance across the board gives you a broad view of problems of varying scopes. It may mean that you'll need to collaborate with specialists for some projects, but it sets you up very well to be a technical leader or a data executive. Just a word of caution. There may be a tendency to neglect data mechanics, or to assume that someone more junior will take care of it. The person who does this is a data scientist, Diva. This is an anti-pattern. This is someone you do not want to be and you do not want to hire. Mechanics can get gritty, and some days it'll occupy the majority of your time, but you never outgrow the need to do it. Embrace mechanics tasks, learn the skills you need to do them adequately, and get on with it. Now there are three specialist archetypes as well, emphasizing each of the three main pillars. If you choose to emphasize data analysis, you can become a detective, a master in discovering the right data and using it to draw conclusions. A detective is familiar with modeling and engineering methods and uses them in a lightweight way, but analysis is their focus. Another data scientist archetype is the oracle, someone who has mastered the skills of modeling and the tools of machine learning. They will have had exposure to analysis and engineering, but their focus will be on modeling, either general machine learning models or custom domain specific models. Finally, you can choose to emphasize data engineering and become a maker. This is a person who helps everything work fast and reliably. Their work will still require some analysis and some modeling and definitely some mechanics, but they'll be focused on engineering tasks. Makers are the ones who transform good ideas into concrete machinery. Finally, there is the unicorn. This person is a master in every pillar. This is the data science ideal, except of course that it doesn't exist. Becoming a master of one pillar takes a long time. Becoming a master in all three is essentially stacking three careers on top of each other. It's not impossible to do, but it's a long road, and the path down it passes through one of the other archetypes as a waypoint. The archetypes then that are the most useful for career planning are the generalist, detective, oracle, and the maker. Another way to represent these is in a 3D plot with axes for analysis, modeling, and engineering. You can see the detective, oracle, and maker each lying close to an axis. The generalist is a balance of the three sitting in the middle of the space. I want to be completely clear that I am not implying any hierarchy or value judgment between the archetypes. They're each invaluable and they serve different functions. Let your preferences and interests guide you as you choose between them. There's no commonly accepted taxonomy for data science specializations. When looking for a job, I recommend ignoring the job titles. There's no reliable way to translate between job title and archetype. I also recommend not worrying too much about the specific tools listed, languages, programs you ought to know. Instead, look at the skills implied. Read between the lines to discover which archetype they want. Does the posting sound more like a maker? A generalist? A detective? Based on this, you can decide whether this particular posting is a good match for you. Also, when posting a job, ask for specific skills. Don't ask for all of them. And don't ask for a unicorn. You won't find one. What I hope you take away from this, more than anything, is that you don't have to be an expert at everything to be a great and successful data scientist. It's enough to choose one direction and dive in. Good luck. If you haven't already picked your path, I hope this helps you plan it, and I wish you lots of success along the way.