 Data Science using the most popular computer language Python bought you by the School for Data Science and computational thinking at Stellenbosch University. Data Science is a group term for extracting knowledge from data using modern computer tools. The abundance of data and the rapid growth in compute power together with powerful free open source software have all combined to democratize the use of data. Welcome to this course on data science using Python. My name is Dr. Jean Clapper and I'm a research fellow at the School for Data Science and Computational Thinking. I'm the creator and instructor for many online courses, the largest of which has more than 100,000 participants from all over the world. I've also authored a textbook on statistics. One thing is for sure, I'm passionate about learning from data and about showing others the beauty of extracting knowledge from data. I hope that this passion shines through during this course. I really want you to be passionate about it too. A week is a short time in which to learn any subject. In deciding what to put into this course, I want it to be cognizant of the audience. We are all the main experts in our fields, or at least learning to become experts. And for most of us, there is just so much data from which we can extract knowledge. I want to leave you at the end of this week with a full understanding of the power of data science. This means that this is not an easy course. Learning a new topic takes time and effort and practice. A computer language is also much like a spoken language. You don't just pick up a new language and become fluent in it in one week. This course aims to then both be a reference work packed with information and also a personal guided journey. I have created video lectures, multiple PDF documents, practice exercises, solution sets, and we're all going to have online sessions. I really want to support you during this start of your journey. There's no way that I want to leave you with some superficial useless start. So I need you to stick with us. Remember, there's no pressure during this week. There's no expectation to understand everything. We are not aiming to have a deep conversation on the meaning of life in Latin after this week. The aim is rather for you to have a full understanding of the potential of data science. I want to guide you towards a future where you can join the massive community of domain experts employing the power of data science in their field. Python has become the leading language in data science. It is an easy to learn language with a very clear syntax. Once you know a little bit of Python code, you can pretty much guess what a new piece of code should look like. Answers to Python related questions are also everywhere. Any quick search of Google will show pages and pages of links to tutorials, videos, discussion boards and so many other resources that will answer all your questions. You will pretty much never get stuck when you need an answer about Python. A computer language needs a program into which you type your code. In this course, we are going to make use of Google Collaboratory. Now, Colab for short is very similar to Google Docs. Once you have a Gmail account, you have access to Google Docs. A Colab notebook looks like a Google Doc. It is just a blank web page in which you can write normal words and sentences, format titles and subtitles, and add pictures and videos. You can also enter Python code though and see the results of your code. Using Google Colab means you don't have to install Python on your local machine. It is completely possible for you to do so though if you want to. If you have questions about this, we will discuss it during the live sessions. I mentioned the word notebook. This is what a Colab file is called. There is a series of notebooks that you will have to upload to your Google Drive. Later in this video, I will show you how to do this. You will also have access to a set of PDF documents that can serve as references to read if you are not into watching lectures. The course is structured so that you need to watch the required video lectures and or read the documentation each morning or in the evening before if you want to. There is a set of exercise notebooks that you can then attempt after each lecture. At a specified time every afternoon we will have live online sessions where we work through these exercises and discuss related topics. There are 14 chapters in this course and I want to tell you a little bit about each of them. In chapter 1, I define modern data science and talk about the software and tools that are available to us. I walk you through Colab notebooks and show you how they work. There is also a small demo using actual data about data scientists from all over the world. In chapter 2, I talk about data and data types. This includes some of the terms and definitions used in data science. I also touch on the important idea of tiny data. Capturing and cleaning up of data is probably one of the biggest and important tasks in data science. Chapter 3 introduces Python. We start with simple arithmetic, which is a beautiful and easy introduction to coding in Python. It is a compact chapter that brings you up to speed with a language with enough information to get you started for the rest of the week. Chapter 4 is all about importing tabular data. That is data saved as a spreadsheet file, which is the most common way to import data into Python. I introduce you to the Pandas package, one of the biggest reasons for the success of Python in data science. So what is a package? Well, Python is a core language with new versions coming out all the time. Because it is an open source language, anyone can add functionality to the language. Thousands and thousands of developers from all over the world add new functionality in the form of packages. We import those packages into an active session of Python and it greatly expands the capabilities of the language. Pandas is a package designed to work with data. We use it to clean data, manipulate data and to extract the parts of data useful to our required analysis. Chapter 5 is about summarizing data. You know all the basics about means, media and standard deviations, variance and the like. As humans, it is very difficult for us to extract meaning from massive amounts of numbers and text. Instead, we summarize the data to start understanding what the data is trying to tell us. Chapter 6 builds on our understanding of our data by visualizing it. Data visualization is core to data science analysis. It allows us to communicate our results and work with others. Python is great for data visualization. There are many packages designed for creating plots and figures. In this course, I'm going to use Plotly. It creates both static plots for printing your plots and interactive plots for the web. Chapter 7 starts our deeper dive into data science and is about randomness and sampling. I discuss probabilities, random variables and distributions. It is much easier than you think and I use this chapter to build intuition rather than diving into mathematical equations. Chapter 8 is about hypothesis testing, the bedrock of the scientific method. I start with two examples using proportions and means. The examples illustrate the concepts of hypothesis testing and how our research questions must be tailored to the use of data to solve our problems. Chapter 9 is all about comparison of means. We simulate test statistics using resampling based on our hypothesis. This allows us to understand the likelihood of our results. Chapter 10 develops your understanding of uncertainty. Uncertainty is a key concept in data science. I discuss bootstrap resampling and confidence intervals, tasks and calculations that are very easy to do in Python. Chapter 11 introduces linear modeling. Modeling is key to the work of many data scientists. We use data science to predict an outcome. In linear regression that outcome is a number. Once again Python comes to our rescue. As the scientific Python package SciPy and other packages such as stats models and scikit-learn make the creation of models a very simple task indeed. Chapter 12 sees a shift in our direction where I introduce you to the wonderful world of machine learning. The modern approach to artificial intelligence. In Chapter 11 I introduce the k-nearest neighbors machine learning algorithm. I go through the whole process involved in a machine learning project. In the last chapter I show you how to use a random forest as a machine learning architecture. It has gained much traction lately and is great for tabular data. It is a very interpretable machine learning approach. Now the cutting edge is the recently released decision forest architecture by Google using the ICTRACIL C++ library. And that is where I'm going to leave you right at the cutting edge of data science. Remember to read the descriptions down below that will give you an update of which chapters are covered on which days of the course. Before you start on this journey let's go inside so I can show you how to prepare for this course. We are indoors, it's a lot warmer, cozier and of course much less wind. So I've opened a browser and I'm at www.google.com. It's as simple as that. On the top right hand side we see sign in. Once you've clicked on sign in this page will appear. Of course if you do have an account you can simply sign in. If you don't have an account click right down here create account and follow the simple instructions. Once you've signed in you'll see these little dots at the top right. When you click on them you'll see all your Google apps. One of them is your Google Drive. You can simply click on that and it will open your Google Drive. And here is my Google Drive. For this course you'll have access to this file. Once you've downloaded it on your computer as you can see there is a zip file. You can right click on it and here on a Mac will go open with and archive utility which will extract that file. Or on a Windows machine you can also right click and say extract all. And that is going to give you a folder which you can simply drag into your Google Drive. You can also click on new and say folder upload. Select the whole folder to upload not the individual files. Once you're done you'll have data science folder in your Google Drive. Let's right click on it and let's just change the color and I'm going to choose orange. I like to change the colors of my folders so that I know exactly but just quickly glancing at the screen where my folders are. Let's double click on the data science folder there and these are the files you're going to see. Colab notebook files. If we right click on first one we go to open with Google Colab. And there you go a Google Colab file as you can see there's text, there's titles. It is just a lovely coding environment. There's a few lines of code. We see the results of the code and a beautiful plot. Inside of the data science folder by the way you can see it right up there. We in my drive and then in data science there's a data folder which contains all the spreadsheet files that we're going to work with. The exercises that will contain the files that you can use just to hone your skills a bit and then there's also a folder for images. And it really is as simple as that. I can't wait to see you in the course.