 Hello and welcome everyone. My name is Eric Fransen. We would like to thank you for joining us today for this webinar, a production of Dataversity with our speaker, Charlie Volmer of ML Engineering. Today, Charlie will be discussing stepping into data science. Just a few quick points to get us started. Due to the large number of people that attend these sessions, you will be needed during the webinar. We will, however, be collecting questions in the Q&A module box in the bottom right-hand corner of your screen. You may see a chat box as well. That's very useful for reaching me, Eric Fransen, your host. I will be happy to help you out with other issues during the webinar, but if you have content questions about the webinar, please do put those in Q&A. We will reserve time at the end of the hour to get as many of those as we can. As always, we will send a follow-up email within two business days containing links to the slides, the recording of this session, and any additional information that may come up during the webinar. And now a few words about our speaker. A Minnesota boy at home in the mountains, when Charlie Volmer is not teaching his six-year-old how to ski powder, he's using mathematics to tell computers how to discover patterns in data. He believes anyone can do machine learning and that by sharing information on computer science, we are all better off. He thinks that if you only give him the chance, he can teach you any statistical concept and that you'll walk away actually thinking positively about math. I, for one, already feel smarter. Please welcome Charlie Volmer. Charlie, are you there? I'm here. Thank you very much, Eric. And I want to say thank you to everyone out there attending and thank you to Dataversity for hosting this webinar today as well. I take it that I can be heard and everything is all right. We hear you loud and clear. Great. So this is going to be kind of a whirlwind stepping into data science today. I like to get to know my audience, so this is kind of a hamper that in this webinar setup that I can't really tailor to my audience so much because I don't know the backgrounds and I can't really reach out too much. So just leave any kind of questions and tailoring to the Q&A at the end, please. But other than that, I'll introduce myself. So my name is Charlie Volmer. I am a consultant with ML Engineering. We're a boutique consultancy firm based out of Colorado, but we work with anyone anywhere, essentially. I am a statistician and a mathematician. I consider myself a strong programmer, but I want to point out to everyone and really stress that if you have the desire, you can take the steps that I'm about to lay out for you in your free time event, say 30 minutes a night sitting in bed or on your couch, and you can take these steps to become a data science and break into data science as well. And you'll see, and I'll try to present to you and convey how I believe it's important in all aspects across all industries in all disciplines, people all over. So let's kind of get started. I want to get on the same page with everyone as what is data science? What is data science to me? So we get on the same page, so I'll kind of go over what is data science. Then I'll get into more technical dive into how to actually become a data scientist and actually lay out steps. We'll go into case studies where we actually see data science in action, if you will, and hopefully you'll leave here with all of the resources needed. You will leave here with all the resources needed to begin your journey down the data science path. So what is data science? Well, I define it as turning data into insights, making it useful in various ways. And you may say, well, that's not new. Isn't that what we're doing already? And when I go out and consult, that's not what we see. So I'm going to give you a few things to keep in mind that data science kind of is not. It's not BI in the traditional sense, where we're kind of looking in the rear-view mirror, right? It's not just operational reporting. It's not just slicing and dicing, warehouse, or silo data. It's really, data science is really forward-looking. Okay? So there's a focus on the predictive power. We want to be able to make statements about the future, about future customers, about future products, about future interactions within your domain, whatever that may entail, right? It also has a focus on data mining or pattern recognition. Not all the information contained in the data has kind of pre-canned questions, right? And patterns that can be found actually inform decisions. You use data science to discover the questions to even ask in many circumstances. And there's also kind of a new age twist on data analytics, which is real-time analytics. And data science really enables those data products that can inform us immediately, where traditional analytics and BI approaches and tools are kind of ineffective at delivering this type of approach. And if we look at what data science is, it's really a process. Here, we're understanding and collecting. We're exploring and cleaning. We're modeling and validating those models. We're deploying and communicating and telling stories from our data. There's really no clear beginning. There's no clear end to this picture here. And we're actually learning at every stage. So at every stage, it's not kind of like, oh, what is the answer, right? Because this is actually a continuous, iterative process. Iteration is really, really the foundation of what data science is. So I can kind of define what a data scientist is as well. So I would say it's people with skills in both coding and statistical modeling. All right? Those are the foundational skills that a data scientist has. But if we look back at this process here, there's actually a lot of exploring, communicating, understanding, and storytelling that the traditional coder or mathematician doesn't quite encapsulate. So I also add to that an intimacy or familiarity with business, with the business or problems that you're addressing. And if we look at it from this perspective, it's not really a data scientist, but really rather a team. And I want to get that point across. It's a collaborative effort from the business, IT, statistical side. People with more of a business acumen are the ones that are more able to ask questions and form hypotheses and perhaps more intimate with the business processes, the goals of the business, and aligning that with the data is vital. We need people that know databases and can clean and munch the data. We also need the mathematical, statistical side that can perform this pattern-finding, data mining, and predictive model building to actually make use of it. And this is a nice little chart here. And it kind of shows different people within data science actually have. So we see that a data business person has more business acumen, but still needs to have familiarity with statistics and machine learning to understand how we can make use of those tools to actually make actionable insight, whereas there's other roles within data science which are heavier programming skills, heavier on the machine learning and statistics side. But in general, there's a vast spectrum of skills and people as well and really compiles a data science team. And here are some everyday examples so that we can see and be on the same page. Things like Amazon's product recommendations. Not only can they recommend products to you, but they're using data science to beat competitors using machine learning for their distribution centers. They're better able to put products near the people who are actually going to buy them. Facebook and LinkedIn connection suggestions. This is something that a lot of people are familiar with. This here is a job description and this fits your profile well. Here maybe you want to connect with them. Targeted advertising. You can kind of think of this in the traditional sense of you don't see kids toys commercials during primetime television, but rather on Saturdays when they're actually putting cartoons on the TV, right? Well, there's a new kind of twist. Now Facebook knows exactly your demographics and they can target advertising even more so, more targeted micro broadcast to the individual. Fulfillment centers. Walmart is ahead of the storms. It's putting the things that it needs where it needs them to be. Amazon, the micro organization of distributions. Product personalization. If you're a financial company like Wells Fargo, will Charlie Vollmer get upset if Wells Fargo sends me 24 e-mails with 24 different products? Yes, I will. But if Wells Fargo has the ability to look and say, oh, your data shows that your account, you seem to be spending lots on groceries out of this account. Maybe you want to try this credit card product which offers 5% cash back on groceries, right? And then a quick thing is do you think that data science is in your organization? Well, just ask yourself. In my organization, is data something that's managed or is it something that's actually profited from harness, employed, and capitalized on? And you'll see that in a lot of organizations, the ones that we consult for, we definitely find that the managing is essentially all that they seem to be doing. And to compete tomorrow, you need to be able to harness and profit from the data that you have. It's valuable. Okay, how to become a data scientist? Well, I'm going to tell you that there's five easy steps. They're right here. You need to learn the tools of the trade, right? You can't possibly do data science without computational tools, and these are the tools that you need. You need to know probability and statistics, okay? And nowadays with the Internet, you can actually learn probability and statistics and sharpen your own skills in those domains using the tools above. You need to get your hands dirty. Not only do you need to learn these skills, but you need to be able to try to apply them in new domains that you haven't seen before with new messy data that's real-life data, right? I'm going to say four and five. These are kind of perhaps not the first things that come to your mind, but I'm going to argue that they're equally as important as knowing the skills and having the experience, because they connect you into this rapidly changing and growing community, as well as keep you thinking like a data scientist, okay? Which keeps you a valuable person. So, learn the tools of the trade. I'm going to argue you need to know R, Python. I'm going to say Jupyter Notebooks, and definitely you need to learn SQL language, okay? Why is this? Well, not only are these the tools, these are the programming languages. They're open source. They're free to everyone. They're fast. They really enable the data science process, and it's used everywhere. Everyone understands that SQL is pervasive, right? Relational databases are pervasive everywhere, and that's for good reason. But all these other tools are the same tools R and Python are what data scientists use all over the world across all organizations. There's tools coming through, so it's kind of hard to argue for tools, right? But as you change companies, you change vendors, now you have to learn new tools again, and I'm not trying to say learn these tools because I think that they'll be around, but learning these tools, they're definitely used now, but also the skills that you learn will be applicable in the future, so they are relevant to the future, whether or not they're specifically used at a particular organization. And it's easy to say, why not COBOL or C++ or SAS and Oracle? And my response will be, well, it's 2015, okay? These tools are being studied in universities, so if you're going to start a project in your organization, or if you're going to join a new organization that's also in a line with your own goals and what you're doing, you need to be in alignment with them. So here's another thing, developer time versus CPU time. C++ is very fast, it's blazing fast, right? Well, take the Facebook model, move fast and break stuff. R, Python, the notebooks in SQL, they allow for very fast, quick iterations so we can actually keep this process of data science moving quickly and actually gaining insight. Licensing fees, I won't touch on this too much, but sure, getting vendor support is a good option under some circumstances, but developing in-house expertise is just a plain good business decision for long term, right? So how do I learn R? You asked me. Well, I'm going to tell you, learn R using Swirl. So here's your first resource. Swirlstats.com, okay? Swirl is a fantastic resource for you to learn R programming and data science interactively at your own pace and it's right on the web browser. This is the kind of example where you can take 30 minutes tonight, December 10, 2015, when you're sitting in your bed or you're sitting on your couch tonight, type in this URL and I guarantee you, the people that are here, you're interested in data science, you will be engaged and motivated by this website and you'll be on your way to learning the tools, okay? How about Python? Well, learn Python the hard way. Here's one resource. This is a very fantastic website to actually learn the Python language syntax and also computing in general, okay? And how to think like a computer scientist learning with Python, another interactive website where you can learn the actual tools here, the programming languages and it's also going to teach you the fundamentals of computer science as well. And these are tools that have been around for a long time and they're kind of hardened and more hardened if you will and they work. How about probability and statistics, right? You need to have a background in probability and statistics. Well, go through courses that heavily make use of the above tools, okay? Knowing statistics is much more useful if you can actually apply it and nowadays we have courses and resources online that are free for everyone to access. Again, 30 minutes tonight, you can start building your own acumen, your own skill set and making yourself more valuable tonight, right? Here's one with applications in Python. Actually, did I click over? No. Applications in Python, think stats, probability and statistics for programmers. This is a free resource with applications in R on introduction to statistical learning. So these are both, they give you more of the mathematical foundations of statistics and you actually learn by doing. So you download data sets and you learn statistics while using the tools that I just presented which are the same tools that they use in different companies. These are the tools that are being requested by companies and skill sets that you need to have to be marketable. Okay, third step, get your hands dirty. Take Harvard's free online course, CS109, which is data science, okay? And now you might be thinking, whoa, whoa, whoa, Harvard? Yes. Harvard's free online course. Okay? Take it. You. You, the normal Joe Schmo, will be fantastic if you take this course, okay? It's available to everyone. It's accessible to everyone. You need to have a little bit of background. So the first couple steps of getting familiar with on Python and learning some statistics and probability are good starting. They're fantastic platforms to then jump into this introductory data science course. It's free, of course. Why do I argue that it's the right thing to do? Well, it puts it all together for you with solutions to every problem presented, okay? This course, this particular course, it goes from the beginning of the spectrum to the very end of what a normal workflow might look like on a specific project. From data wrangling, cleaning, and sampling, so to get a suitable data set, to data management, which gives you access to data quickly and reliably. Exploratory data analysis to help generate hypotheses and intuition. It goes over, oops, building predictive statistical models such as regression and classification all the way to the end communicating the results through visualization, stories, and summers, okay? And not only does it do this, but it does it in kind of a hand-holding manner because there are solutions and there are lots of forums available to you so that you can actually, since you don't have too much experience yet, you can go through and actually have kind of a lot of support when you're attacking these problems. And then once you kind of have that background in you, you have lots of experience now from taking this free online course at your own pace, on your own time, now it's time to try getting your hands a little bit dirtier, and you can find data sets and problems in Kaggle.com. So this is a data science competition platform. They present data in a problem, and you have to go through and try to solve the problem through whatever manners you possibly can. Now this is a little left hand hold it in the Harvard course because you don't per se have solutions to them. It's more messy like the real world and what you will actually encounter. And this is great training, if you will, for your next steps earlier on. Then I say you have to connect with the community. You need to plug into the community of data scientists. Here's a couple blogs for you. So the first one here, 538. Again, after spending 30 minutes tonight over on swirlstats.com, learning R and going through data science using R online interactively, hop on over to Nate Silver's blog. You'll see that he takes data and makes inferences from it in kind of a twist on pop culture things, heavily slanted towards politics in a lot of circumstances. But it's very motivating. It's very interesting. And you'll see how great data science can be, not just in your particular organization, but why people are using data science to make the world better. Punch.net by John Langford is a little bit more hardcore. It's definitely a resource for people who have a strong background. Maybe you come from a more development background. Maybe you're already a database architect, and you want to see more of the nitty gritty of machine learning and data mining. And here's the last one, visual complexity. It's fantastic. And you can see how people are storytelling with data. If you've ever run into where you have the case, where you have data that shows a certain inference, you can make some sort of a statement from this data, but you have people that just don't want to listen. They're hard headed while visualizing the data and showing it through visualization is a fantastic way to communicate data. Here are two people on Twitter. You can Google them as well later tonight when you're sitting in your bed. Hillary Mason and Ryan Rosario. Again, since this is Twitter, it's a little bit faster paced. They show you lots of little clips of very, very interesting and highly motivating examples in the real world. These guys are connected. They're very well connected in the data science community. So they kind of show you the highlights of data science happening, if you will. Great personal development. And then five, the fifth step was after you've done all this, you want to expand your skills. You want to make them even better. So I kind of say, learn some data science electives. What would be an elective? Product metrics. What do companies track? What metrics are important? How do companies measure their success? I'm not sure if all of you are familiar with Pinterest, but Pinterest is a fantastic online platform, and they have created a free, I don't want to say a PDF, but you can go and see what's important to them. They made a blog post about the 27 metrics that they use to measure their own growth. And you can see how data science is used within their organization to actually change and grow and be dynamic and interesting and relevant products for their own customers and for prospective customers as well. How about AB testing? This is something that big farm has been doing for decades. Design of randomized clinical trials. Here's a free course there for you to go and sharpen your AB testing skills. How about user behavior? Who's going to buy your products? Who's going to use your new app? Who's going to buy the next service that your company is going to let out? Well, understanding human psychology is the science behind that. And then you can also go on to learn big data technology, machine learning, natural language processing, time series analysis, and these are all further down the line after you've built up your skills if you will. So now let's start looking into actually using data science to understand it and to see how we can apply in our own domain, our own specific industry in our own organization. Here's a nonprofit example where we have a charitable organization that has data for potential donors and they're interested in whether a person will donate and if so how much? This can easily be translated into your own industry, your own organization as within your own domain. I'm an insurance company. I have all these products. I want to know who's going to be interested in my products and which products. Or I give out student loans. I want to know who's going to take my student loan and for how much? How much are they willing to pay? How much are they going to ensure themselves for? Across industries there's many ways to take this setup, this context and translate it into your own. First we need to get on the same. So we're going to build predictive models. We're going to use data to build predictive models and we need to understand and have a common understanding of what a model is. Here I define a model as a simplified description of a system or process to assist predictions. Kind of worried but if you just sit there and stare for a second it's actually not that complex. It's just a description of a system to help with predictions. So a relationship, describing a relationship. Keep in mind all models are wrong but some are useful. So let's look at a couple. Here's one. I have a bunch of data on salary and education level. And I'm going to use a line to fit through this scatter plot of data. And this line is my model. It's describing the relationship between how much education, how many years of education have you had and what is your annual salary. This is kind of outdated data but it gets across the point of a model. So while this model is not correct it doesn't predict exactly. If I tell you that John has 16 years of education so he's got a bachelor's degree can you tell me exactly what his salary is? No. But you can tell me on average he's making 20,000 more than someone with just a high school degree. So this model is useful. What if a model gets a little bit more complex than just a line? Well take a random person on the Titanic and find the probability of survival. And this model that we're going to use is called a decision tree. So we know the sex, age, the number of family members on board and whether a person lives or not. And we can actually see, right, we just take each individual and kind of put them down the tree and they're going to end up in one of these what we call a leaf. And we see that 36% of all people were, so you go down this tree here, it takes a little bit of time to see how I put these numbers here. It's taken from Wikipedia. But it's very easy to interpret. If you are male you go to the right. If you're female you go to the left. Then if you're age you're greater than, if you're over nine years old you're going to go to the left. And you actually get a probability for whether you die or survive. So we can take a random person on the Titanic and give them a probability whether they would survive or not. It's kind of more both but it gets across the point of what a model is again. Now how do you build this tree? You don't actually even need to know exactly how these algorithms are implemented because they're already written into all in Python for you and lots and lots and lots of community support around them. Okay. So we haven't built any models so let's build some models. How can they be useful for us? This time we're going to start with a clearly defined problem statement. And again, we quote John Tuki here, he was a famous statistician that an approximate answer to the right question is worth a great deal more than a precise answer to the wrong question. So what was our question? We have this nonprofit organization, they have data for potential donors and they want to know whether a person will donate and if so how much? This could easily be translated into data on potential customers interested in whether we'll purchase and which product. Here's the data we have. So we actually take some demographic data, we know things like their gender and their number of children in their house and then we also have personal economic data that can even we actually took data from the census which is public data to know things like median family income in a donor's neighborhood or the average home value in a donor's neighborhood. And then we actually meld that with data that is from the organization itself, right? This is proprietary data that they had, things like what was the last, the dollar amount of the lifetime gift to date, what was the last gift that they gave, how long ago was it? What was the largest gift that they've ever given? And we're going to use all this data to predict whether someone will be a donor or not and then how much will they donate? So here we're pulling the data from a MySQL database in this case and I kind of show you what the data looks like. Here I have a note on if you're pulling from a database, if this is very large, you can either pull it directly into a distributed file system or you can subset that into a sample in your workstation. We'll move on, we won't get into those details too much. But you see that loading data and quickly getting our hands on it is very easy with the tools that I'm preaching here. This is, this data or this code that you see here is R. It's very nice and easy to learn. If you have any kind of development background, you'll pick this up instantly and it's even easy for non-programmers to learn. Here I'm just loading in the dataset and I'm looking at just all the average home values in the entire dataset. And R allows us to quickly and easily manipulate this data. Here I'm transforming the data to look more like a bell curve. So I'm taking all of the different variables on the top row and I'm transforming them to look more like a bell curve. This is an assumption of some of the statistical models that we'll be performing. And keep in mind now, we want to classify each individual. We have all of this data on potential donors and we want to classify whether or not they will donate or not. And there's different models that we can use to achieve this. Here I'll quickly go through these. Here's one. So now instead of a scatter plot in two dimensions, we have three dimensions here and we're using, we're finding the plane, so we call this a plane, a hyperplane, that best separates the donors from the non-donors. Here's another one, linear discriminant analysis, LDA. Again, it's trying to separate the different groups from one another. Here's another way, quadratic discriminant analysis. Instead of just straight lines, now we can use quadratic curves to help separate the different groups. Again, not too much detail here. And so the thing is we have lots of different models that we can try to use. So these are different algorithms to try to decide who's a donor and who's not. And it's actually difficult, especially in high dimensions, when we have 20 different variables on one individual. How to separate these out. Here's another one, logistic regression. So it's another way he actually gives you a probability whether the donor, someone is going to donate or not. Okay. Now keep this in mind for the rest. We cut our data into two groups, a training data set and a validation data set. So the training data set is what we'll actually run our algorithm on. And then the validation data set is data that the model never saw when the algorithm created the model. But we have, we know in this data set, these are both kind of training data sets in a sense that we know whether they donated or not. Okay. So these are people that we know whether they donated or not. And we're using a part of that to build a model to describe the relationship between those 20 variables and whether they donated or not. And then we use the validation data set to actually validate does this model work well for data that it's never seen? Okay. Here I show you in 23 lines of code and it's not even all lines of code here. We're able to construct all four of those extremely powerful models. We're constructing SVMs, LDA, QDA, the logistic regression, all very fast and easy using these programming languages. Okay. Each model now gives us a probability of donor for each individual. And now I want to know, okay, so what probability should I use to actually act? Do I want to send out, do I want to reach out and try to contact everyone that has 90 percent probability of donating? Probably. How about 70 percent? How about 60 percent? What about 50 percent chance? Right? It cost me money to have my salespeople calling. It cost time and money to have them call every single person. So we want to try to just go for the people that have the best chances, the highest probability of actually donating. So to find a cutoff probability here that gives us the highest expected return, we know a couple things in this situation. So in this particular situation it costs $2 to send a letter, but this could be 15 minutes to make a call, it could be whatever your specific domain kind of entails. And we know the average donation from past occurrences with their campaigns. And we know their past donation rate, which is about 10 percent. So we can use all this information to try to actually tell us which probability we want to use to send to contact potential customers or donors here in this case. Thus, for every $20 we spend, so that's 10 letters if it's $2 per letter, we get 1450 back on average, which is a negative expected return. So we only want to send to those with the highest probability of donating. But again, who do we send to? Just those with 82% chance, those with 50% chance, let's find a cutoff. So we use the data that we have to actually find what is the cutoff probability, what gives us the highest profit, and that's right where the curve starts going back down. So sending to any more people past that is just going to give us negative returns or less of a profit. And each model actually gives us a different max profit achieved. So this is how we determine which model we should actually implement and use on the new data that we don't know whether they're going to be a donor or not. So far we've only used past historical data, we know who has and who has not donated. But we don't know which model to use on the new data of new people that we haven't actually ever reached out to. So we see that, oh, boosting, this model boosting seems to be the best model under this circumstance because it provides the highest profit. So each of them kind of say different things. Maybe the SVM says that John has a 59% chance of donating, but the QDA thinks that John has a 53% chance of donating. So they actually do different predictions. So we see that the best one is called boosting. What is boosting? So boosting is a tree, using trees like I showed with the Titanic example, except instead of just making one tree, I'm going to randomly create lots and lots of trees and then average what they say. So if I create 100 trees and 95 predict that individual A trees, well, 95 out of 100 of these trees thought that you would donate, so I'm going to predict that they'll donate. Okay? And so this is, it's called a random forest now, and with boosting, this is one particular model that you could learn about. And here's all of the code to actually perform what we've done. Everything that I've done so far is performed using this 25 lines of code. There's a lot of white space in the 25 lines of code. Okay? And now we look at our validation set, and we can see that. So my model for each individual is either going to predict zero for not donor or one for donor, right? And we actually know whether they donated or not. And we can see how well it does. 775 predicted zeros, they actually were. 989 predicted ones, they actually were. Doing pretty well. We can visualize the performance of the model, and this is useful for data scientists. But we also want to know not only are they going to be a donor, but we want to try to predict how much can we expect from them. And models can predict this, too, actually. We won't go over this today, but going through the same process, instead of a yes or a no for a group, the output is just a number that can take any value. $22. $400. Okay? Now, what else can we take from these models? Well, if you just look at this tree, can you tell me which variable is the most important in your survival? Is it sex? Is it age? Is it the number of sibling spouses that were on board the Titanic? Will that help me to best predict? I want to know which variables are most helpful in predicting whether you're going to survive. Just like in the nonprofit organization, I want to know which variables are most informative as to whether you're going to donate or not. And boosted decision trees tells us. Okay? So this little plot here is actually now the same information on this table. So we see that the relative influence, so the most informative variable was actually number of children for predicting whether you're a donor or not, for this particular nonprofit. Household income was the second most informative, living in region two happened to be very informative, and then being a homeowner. Those are the most important variables. Now the business has a huge, huge amount of power now. They know what they need to go and look for to help them to actually, with their campaign of seeking out new donors, right? Now they're much more informed. How about for predicting donation amount? Well, not surprisingly amount of most recent gift, amount of largest gift, amount of average gift, and living in region four was the most informative in predicting how much people would actually donate. This helps the business to know which data is most important for them to collect. Now we're back in this process again. Okay? Now we have, we know what to now go and look for as well. And this will come back into this cycle of now getting more data, now using it differently. We're going to build new models with better data now, right? Not all data is equal. How about that stubborn manager that says, I think we need to target men. I know, I just know it. I know that men are better for audience for us to target. Right? This is funny. In both the donor and the donation amount case, gender was the least informative variable. So it was the least informative to predict whether you're going to be a donor or not, and the least informative to predict how much you were going to donate. So even though they were the least informative, somehow that is informative for the organization, right? This actual insight that we can take away and use in the organization. Okay. So we've gone through how to be we've gone through what is data science? What's a data scientist? We've gone through what are actual steps that I can take tonight? Charlie Vollmer, I'm sitting and I'm just looking at Facebook. I'm going to shut my Facebook and I'm going to go to swirlstaff.com. I'm going to go and look up Hillary Mason on Twitter, right? We've gone through that. We've seen use cases of actually applying data science to data within a particular organization. And the types of inference that it can glean from the data that it has, and even data that it doesn't have, like public census data, right? And now I want to tell you to leave your thinking like a data scientist, okay? There's kind of a few bullet points I want you to take away. I want you to see data as a tool to improve consumer products, okay? This is the whole do I just manage data in my organization? Or do I actually harness it? Do I actually profit from it? Do I capitalize on it? It makes our world better, right? Self-driving cars will instantly get rid of 200,000 highway crashes and deaths every year. Convince others of what's important, right? Those stubborn managers that are hard to communicate any kind of insight to. Well, we don't just want to use subjective insight and intuition, which a lot of times is very helpful. We want to be able to convince people using objective, data-driven decision-making. Know the limitations of your tools. I want to just kind of throw that in there as well. As you go on your data science journey, you will see that a lot of the speculation around machine learning and artificial intelligence, especially advanced things like deep learning, you will see that it can't do everything and there's not this all encompassing solution to everything, right? That's why it takes a team of people to really understand the inferences drawn from the data. It takes a team of people to really understand and build products that will help to make consumer products better for all of us. Receive news with a skeptical eye. Okay. So by going through those blogs like made silver, you can see where they take a lot of public cultural things where maybe one institution wants to show you things and show you data in a certain way, but you should now have the tools and the skills to really be skeptical of what people say, even using data. Five, satiate your curiosity through data. Data is I don't need to convince you all the data is important, but I really want to stress that you can satiate your curiosity through data. You have some sort of a question. You can use data to answer that and it's actually well founded and other people will find it interesting. When you start connecting into those blogs and Twitter accounts of people doing data scientists, it will really open up your eyes and motivate you as it did to me in your own personal lives. So all of the resources that I gave to you, please utilize them. I'll also be speaking and we will be speaking so ML engineering will be out in enterprise data world giving a short course. So it will be elaborating on the concepts outlined here today as well as walking through examples. You'll actually leave with working code and with skills that are actually marketable today in organizations across the world. I'm happy to allow anybody to reach out to me personally. This is my personal email. You can feel free to contact me with any further questions or inquiries and I'm happy to help guide you on your way stepping into data science. So thank you everybody. Thank you very much for attending today and thank you to diversity for hosting us. Thank you Eric for helping with this. And yes, I'm open for questions from the audience, please. All right. Well, Charlie, thank you for all that information. Folks, please do drop questions into that Q&A module in the lower right. We do have a few in there all ready to get us going here. Charlie, for the first one, if you would please navigate back to slide 25. This is one of the early slides where you were talking about some of the expertise areas that someone might need. So the question is, is it and I think particularly in light of your stressing the need for data science to discipline to be addressed by a team, is it better to become an expert in one of these areas or general knowledge on all three or four? Okay. So what I would say to that is you definitely want to have an understanding and a grasp of what these tools are, what they provide and what are some of the abilities of them. But I will stress, let me slide back a couple more slides, like this, each person within the data science team has unique roles and they provide unique skills and attributes that the data science team really needs to have to be successful. When we go into a company and we try to help them build data science teams, we see that we don't just want a really strong computer programmer, we don't want just a mathematician and we don't want the business analyst who comes, they all have their own biases and their own downfalls. When we bring them all together and we teach them the roles that each of them can support one another in is where a data science team really can take off and be successful in their organization. Okay. The next question is about a specific tool that exists now of these programming languages you suggested. Do you know which one will work best with Salesforce? Okay. So all of them will work kind of, or in Python will both work equally well with Salesforce. Salesforce is a great tool and I also would recommend that people be versed in Salesforce just because if you ever leave your company a lot of, it's just so it's becoming more and more pervasive out there. But these tools play with Salesforce in the same manner. So learning one is not going to be advantageous over the other. Okay. Next question. What is your current favorite data store for ad hoc data projects that require maximum speed and flexibility but also proper security? Okay. So that's a fantastic question and I'm going to answer it using the fundamental theorem of statistic which is it depends. It really is context. I'm sorry to put that out there but it's very context dependent. So a data store really needs to provide you. You need to have context and you need to have certain constraints and you will have that. So when you are deciding which technology, like a particular technology to use, you need to understand your constraints and your context to decide which one is best for you. Any data store you can build a security layer on top of to any extent that you need in your organization. So it's not like oh one will provide it and one will not. But when it comes to speed it really depends on the type of data that you're accessing, the amount of data, how often are you accessing it, even the types of access, are you just writing to a data store or are you going to be reading from it a lot, do you have a web application and these VI tools that your managers need to be pulling from your data store all the time. So I'm sorry that it comes down to that but it depends. Well I think that's a perfectly fair answer and I appreciate you walking through some of the things that one would have to consider in making that determination. Are there any other things you think someone should really be sure to think about any other gotchas or red flags in the evaluation process? Definitely. So the biggest thing I would say is don't get caught up in buzz terms and don't get caught up in the hype. Just because something is perfect for one organization, you know I go into all the time, we go into an organization and the company usually some hard headed managers are very set, they know exactly what they want, they know exactly what they need and they tell me we need this tool and we tell them look, please step back a second, breathe and just tell me what are your constraints, what are the contacts here, what type of data, the access dot dot dot and then we will actually evaluate. So don't get caught up in oh they're using this so I need to use that because that's not the case that it's going to work best for you. I would also say yeah, let's leave it at that. Okay. The next question is someone had said here they're fascinated with machine learning and have been listening to Katherine Gorman and Brian Adams podcasts, what do you think are the current limits to machine learning and do you think there are limits with data consumption we should impose internally, intentionally, sorry. Okay. So this is interesting, I'll answer the first one first. What do I think are the current limits of machine learning. So this is hard, you have to interpret, I interpret this in two ways. What are the current capabilities, what is possible with machine learning and then what can be possible from the current technology. And that second one especially is the most intriguing to me because I think anything is possible. What computers have already started to be better than humans at many, many, many, many, many things. And I am not one of those that is afraid of artificial intelligence even though I think that technology can be used harmfully. I think that artificial intelligence in and of itself or machine learning in and of itself is not a potential threat to us. But rather we need to develop systems that help augment human needs, right. When we make our car smarter and able to lock in unlock better, right, using data and using technology it's better for everyone, less cars are stolen, right, and it's easier for us to get in and out of our cars. But it can also, that technology can be detrimental if used incorrectly. The second question was what about data consumption within the organization? Correct? Yeah, sorry, do you think there are limits with data consumption that we should impose intentionally? Sure. So this I think is touching on the, we're just kind of collecting willy-nilly everything that we possibly can. And if that is the case, well, it's very cheap and it's easy to collect and collect and collect. Now how much energy and resources you want to put into that, right, if you need to develop and architect a lot of new infrastructure within your organization and your firewall constraints and proprietary with you work in some highly regulated business, perhaps at high levels of resources that are demanded to collect that data that may not be that useful. So I think that the constraints, you can do little test pilot experiments to see whether the data is useful or not, but it's definitely the case that we see that data is not equal, not all data is equal. Some data you have to really squeeze to get any kind of use out of, other data is just filled with insights, right? And so if you can put your resources into getting that smart data, well, that's much more well spent money. And a follow-up question on that one, the questioner is now asking, isn't there a moral issue around a person's data? I would say, of course, there is. And a lot of people are concerned with they think, oh, Google is watching me, it's giving me suggestions, you know, it's creepy that Amazon is giving me this suggestion. They think that someone is sitting behind a computer and looking at your personal information. But that's definitely not the case. You know, these algorithms really aren't looking into your personal life. Yes, it is making a personal recommendation. It's using your personal history, but it's an algorithm that's going through this. It's not a person sitting behind it. And I think that as it becomes more and more useful and helpful in our lives, we will learn to kind of accept some level of intrusion, which it really is some level of intrusion. But we need to be talking about this. What is our comfort level? What is okay? And we need to easily be able to say, this is not okay for me. I don't want Facebook to default to that. I want it to default to something else, right? So yeah, that's how I would definitely express that. Fair enough. All right. Well, I'll hold on here for another minute while we just wait and see if there are any other questions that come in. While I do that, Charlie, would you please navigate to the last slide? I do believe that a number of our attendees would like another look at your contact information in case they have questions. And it sounds like you won't be overly protective about your own data. We're just a quick reminder, another question that has come up several times in my thread today has been regarding the recording of this. So yes, we are capturing today's webinar as a recording. We will also post the slides alongside that recorded webinar at dataversity.net within two business days. All of the registrants of today's live webinar will receive an email letting them when this resources are available. Another question here. Should an undergraduate student do a master's in data science? Do you have an opinion on the value of that sort of a graduate degree at this point? Sure. I think that's a fantastic and very interesting question actually. And it's very interesting to see that nowadays we have the technology and the resources are out there such that you can actually choose your path to your own education which really wasn't quite so available as before. And now it's a great question to actually stop and think should I get a master's in this? Because there are other manners of studying and learning. And I've presented quite a few of them. And especially in the tech industry or in any kind of in data science. If you can show projects that you've worked on and you can show that you have the skills and the acumen well then the degree showing the degree is a little bit less of a concern. And you can still make the high amounts of money and a master's degree in data science will cost a lot of money. Now learning on your own is a skill and a pedagogy that doesn't work for everyone. And so a formal setting, classroom setting and a formal education can be very appealing and very helpful to others. So it's important to ask yourself what is the best way for me to learn? And what are the disadvantages and advantages of these different routes? I want to stress that there are different routes and you need to earnestly ask yourself which route would be best for my own self. Wonderful. Charlie Volmer, thank you so much for a lot of information. I know it was a compacted version of what you will be delivering at enterprise data world in April. I really appreciate you giving us the time today. Very, very helpful stuff. Please to remind everyone who is listening, we will again be posting the recorded webinar and the slides to dataversity.net and you will be receiving an email on how to access that material. Thank you to all of those who asked questions. Thanks to all of you who attended today. I hope you have a wonderful day and a very happy holiday season. Thanks, Charlie Volmer. Thank you very much, Eric, and thank you to the community.