 This is our first lecture. We're going to take a closer look at Python, the coding environment that we're going to use and we're going to see a small project, a data science project. This is a Google Colab notebook. As you can see, it's very well organized. On the left hand side, we have our table of contents and we have a document that we can scroll down. This is a Colab notebook and it lives inside of my Google Drive just as any Google Doc or any Google Sheet. So we see, if I highlight these, there seems to be these little cells and that's exactly what they are and we're going to get to those. And inside of these cells, as you can see, I can put titles, some bullet points there, nice little image there. Very neatly written, it looks almost like a Google Doc, but when we scroll down, we'll see that there's also some code in there. So before we start off looking at Python though, just a little word on data science. What data science is all about is this idea of getting into the information that is hidden in data. We really generate so much data these days. We truly live in information age and what we have is we have access to hardware and software that gives us this ability to find that knowledge that is hidden inside of our data. Now Python has become the leading language as far as data science is concerned and really many leading universities have campus-wide courses on data science using Python and that's exactly what we're aiming for here as well. So data science is really a science that brings together this array of approaches to using data and that's all about the generation of data, the captioning of data, the storage of that data, verification of that data, manipulation of that data so that we can get to the story and then even statistics and modern storytelling with that data through visualization and modern algorithms such as machine learning algorithms and we bring all of that together in data science and fortunately for us data science has many common techniques so it doesn't really matter what real world situation you use it in or on what kind of faculty or department that you're using in whether you use it for astronomy, cosmology, physics, biology and you see a long list of their healthcare, economics, politics, engineering, it really doesn't matter we can apply access to this information inside of data. We can do that through data science. The only barrier to entry these days though really is an access to the internet because this course was run in the Google Drive so you don't have to install anything on your local system it all runs free of charge in the Google Cloud and what we really have then is this idea of putting our voices together to say that access to the internet really has become a basic human right and all governments must work I think towards that because it is this freedom of access of information and the knowledge inside that information that really can solve so many problems poverty, hunger, climate change, illness and many of our current and future challenges we can just solve this way solve the problems of our species as Carl Sagan said on a mote of dust suspended in a sun beam so really really if you think about it very powerful and exciting stuff. Now the tools of data science is of course is a computer language and the language that we've chosen here is Python but there are many others the Wolfram language are Julia they really is just that embarrassment of choices. Python has taken the lead though because it is a very powerful language it's very easy to learn and to understand and many contributors throughout the world just give their time mostly free of charge just to expand the language and that's made it very very powerful. Python's quite old it was created by Guido van Rossum way back in the 1980s and is really of late in the last decade probably because of its use in data science that has become so popular and as I've mentioned we expand the language by the idea of packages modules and libraries that we can import into the language to expand the capabilities of the language. Now to write the language we need a coding environment some program in which we can write the code if you want to write the letter use Microsoft Word or Google Docs you need you need software to write something in and the same goes for computer variable computer language and that's called an IDE in most cases an integrated development environment. So what we do with Google Colab though is we use the web browser as a development environment we can code right inside of a web browser and you can really install Python on your local system and code inside of Python on your browser on your system or you can log into your Google Drive and of course do it all online as we're going to do here. So Google Colab now Google Colab is built on something called a Jupyter notebook this idea of a notebook environment and what you can see here really is a notebook and as much as if we do some data science or some research we can write normal English sentences as in a Word document as you can see here but you can also write code and look at the results of the analysis of that code. So what you have to do is you have to sign up for Google Drive if you've got access to a Gmail account of course you have access to a Google Drive and you can always just click on new and then a little drop-down list is going to appear you can add a new folder but of course a new Google Doc a new Google Sheet you might not immediately see a Google Colab there under the more section you might have to go to this little website here colab.research.google.com and if you open that it'll open Google Colab for you and the next time you open your your Google Drive Google Colab will be there for you so let's have a little look at Google Colab what is under the hood well you've already seen that these nice sentences I can just write them and what I'm going to do I'll highlight this one and you can see it gets highlighted I'm going to double click on it and there you can see it's a cell and there's all the text that I typed in no problem and I also get the ability to format my text just as in a word processor so I can do bold italics underline I can put in a link I can put in an image I can put in bullet points or numbered lists and all sorts of things I can do and then on the right hand side you'll see I can move a cell up or down I can comment on a cell just like you can do with with Google Docs you can delete that but you can also run the cell and the way to run that is to hold down shift and hit enter or return and then that cell gets what we call it gets executed now this is just normal English text so there's really nothing to execute and what we see here is of course is a code cell now if I hover just between two code cells there or two cells there you see the plus code and plus text now that allows me to add a new cell in between these two so if I click a text cell it is now a text cell and I can write my text there so let's delete that or if I click on code just hover in between the two if I code and now I have a cell that can enter code and it's as simple as that just those two things and what we can see here is just some Python code really simple code don't worry about it now and you can see what's generated here at the bottom so as the Google Colab notebook stands at the moment it is it is just serving up to my web browser but it's not really connected to any Python back end and what we have to do is just click on that little connect button there and once we click on that connect button it's going to spin up an instance for us way up in the Google Cloud so that we can use Python right inside of our browser our browser speaks to Google the code gets run on Google side and we just see the results of it on our side of course everything here is saved so next time that I open this notebook everything is still going to be here so now I can run this so I've already told you about shift and enter or shift and return but there's also this little play button there so if I click on that play button you see the little spinning starts that means my code is sent to Google it gets analyzed and the results get sent back to me and all we've done there is created a little spreadsheet file you can see there group A group B and you can see some statistics there how many samples there were with the mean the standard deviation the minimum and we see our quantiles there in the maximum and I can simply say something like df.boxplot and I'm going to get a box and whisker plot of my code of the data that we have they really is that simple we just write lines of code and of course in between we can just put normal English sentences so if I generated this plot and I want to show my collaborators or talk about it a little bit I can either just add a comment or just write the code a text cell just with some normal normal language in there so very very very useful so let me open this little cell here and you can see there's some markdown in here and a markdown is a very simplified computer language I want to call a computer language not really a computer language it just formats and under certain circumstances if you have the right application open you can format it with this language so you can see there in front of this word italics there I've put an underscore and just after the underscore and that's going to turn that word into italics and of course I can also do it with these little buttons but using markdown if you if you get used to this actually quite quick and easy to do here you see I've used double underscores before and after the bold and it turns that word bold so if I hold down shift and return shift and enter you can see the italics there in my normal English sentence and the bold there really easy now if you in certain scientific programs you'll know about LaTeX LaTeX or tech is really a language that allows us to do some mathematical type setting and you can see an equation one you can see here the normal distribution so how did I do that very neatly written if you can see and printed out to the screen let's click on this little cell and double click on it and what we do is we have two little dollar symbols there and opening and closing little two little dollar symbols and what goes inside of it is some tech or LaTeX and that allows for the creation of this nice little mathematical equation that we can see there so don't worry we're not interested in those equations at the moment it's just about showing you what the notebook really can do which reminds me if we go right up to the top you see our little logo there so that little logo there's the line of code if I double click on that's the little line of code and that's just some html code that I popped in there now this image is saved on my google drive I simply got an open link to it and I copy and pasted that link and I just made a little tiny little change to the code you've got to insert that little bit and and you can display some images right inside of your colab notebook as well so if you have a static image that you want to import so if we have a look around the google this google colab though here on the left hand side we see a table of contents there's actually four little icons there's a search icon as well then there's this little code snippet now that's very nice so google was very kindly put a few code snippets there so let's say I want to import directly from a google sheet if I click on that there's the code to do it and in an empty cell you can just hit insert and that's going to insert that code for us very neat indeed and then of course there is just a little explorer there so you can see what is inside this folder at the moment I like to keep the table of contents open because I can just click on it so how is this you know how's this formatted how does the colab notebook know what goes inside here well that just depends on what I do with these cells so you can see that's normal text in this cell but this cell data science survey example that's quite a large text and if I double click on that that's how it's done so this is also part of a bit of formatting and you get a single hashtag pound symbol like that that's the largest text two are slightly smaller three four five six you can go up to six that'll be the smallest sub sub sub sub sub subtitle if you want to do that so two would be the second largest and you've got to put a little space in between the two hashtag or pound symbols and the first letter and that's going to do that for us and what the table of content is going to do is just going to pick up on all of those things that are larger than just a single you know just normal text if I go in there you'll see there's no pound symbol in front of it by the way that's how you can also that little double t's there if I click on that it's just going to add one two three four it's going to add those little symbols for us but we're not interested in that so that's simply how you do that well let's just take that away we don't want that to be a title most definitely that's just a normal paragraph so what else can we see here on the top here we can see our normal file so you can locate in drive a new notebook or open a notebook upload a notebook all the normal things edit copy and pasting some view there insert insert a new column almost never use those because we can control what we have to do right here while we're working with the cells runtime sometimes we would use runtime the change runtime type is most important in a course where we use deep neural networks machine learning or AI we also have access to a GPU a graphics processing unit and that allows us to execute our code much faster but we're not going to be using that in this course just on data science on the top right there there's also a little comment there so if I'm in a cell I can leave a little comment so if we share this this document with each other we can see those comments just as we would do in google in normal google docs of course we can share that as well so let's do this little data science survey example so this is a survey data that's available on the internet you can download it yourself get access to that through a website called Kaggle now if you get into machine learning a Kaggle website it becomes very important in your life other than running competitions in which the competition is really fierce and there's some very big prize money put up by companies that want this problem solved using AI there's also a lot of data now once a year Kaggle sends out a survey to people who use Kaggle and most most commonly those will be data scientists and they answer these questions and at the end of every year there's this massive amount of data as far as all these data scientists are concerned so we've downloaded that that set of data and we're going to use that and just this little example of what is what is possible when using python for your data science so I don't want you to be too concerned about the code here just show you you know what is possible and by the end of this course of course you'll be able to do all of this and as I mentioned there are many packages that allow us to expand what python can do and what we do in this first cell we import one of those it's called sci-pi as you can see there's sci-pi and it has many modules and one of its modules is the stats module so I'm going to write the code from sci-pi import stats and that's going to make all the functionality that's inside of stats available to me which would not be available in just core python or base python but it's certainly now going to make that available for me now there are many packages in python that allow us to create beautiful plots and graphs and one that I particularly like to use is called plotly because that gives us interactive plotting and we'll see what that is all about now this next little cell here it says percentage load underscore exit EXT I should say now that percentage marks that keyword as a magic command and we'll see one or two magic commands this one is very specific to google colab so if I was running a jupyter notebook on my local system I certainly won't use something like this and all this magic command is going to do it just rend this tables very neatly to the screen so I always like to use that when I use google colab and then there's a little function called drive inside of the google dot colab package and we certainly going to do that because our dataset is also saved on this google drive now because of security of course we don't want anyone to have access to all of our work so you have to give this notebook special permission to access data on your google drive and that's what we do here now what I'm going to do is I'm going to run this which will then amount to my drive and it is going to ask me to authenticate myself as a user so it's a new tab it's going to open up and I'll have to sign in with my credentials give this notebook access to my drive a little key will come up and which I have to copy and paste into this notebook and for security reasons I'm just going to do this and you'll just see what the results are as you can see there a little pop up there I can click on that which will now take me to logging in again to my google drive giving it permission giving me a key which our authorization key which I'll copy and paste into that cell and there we are the drive is now mounted and what I can do is use one of those little magic commands again percentage cd and that will change the directory just as you have on your computer where you can go through your directory or folder structure through some explorer or finder we can do the same and in this instance here's the address on this google drive of mine where the data is located and what I'm doing is I'm changing directory to that data so that's what that cell does and inside of this whole folder structure that's where my data is and the data is a CSV file now we're going to talk about CSV file comma separated values files that's just a spreadsheet file that's the name of it and I'm going to import it so that it's now inside of this notebook and I have access to it now I'm just going to delete some of the columns because we certainly don't need them and what we usually then do is just print a couple of rows to the screen and this is what the data looks like you can see very much a spreadsheet that I have imported and because I use that original magic command it prints very nicely to the screen now very importantly we always want to know how big our data set is and there's a little attribute called the shape attribute and it shows me there that I've got 20,000 and 36 rows of data so 20 more than 20,000 people data scientists from all over the world responded to this latest survey and the 355 were the 355 columns inside of the spreadsheet so let's just have a looked little look at this let's look at the qualifications of the respondent I might be interested to know then what qualifications these data scientists from all over the world has and that's a little line of code that I have to write and it shows me there that the vast majority of them have master's degrees 7,859 where 6,978 had bachelor's degrees and 2,302 had doctoral degrees so most data scientists really do have 2 sheet education either bachelor's degrees master's degrees or doctoral degrees and I might be interested in just plotting that out because just to look at numbers is never fun looking at a spreadsheet full of numbers never fun much easier just to create this nice little plot and there we can see the vast majority have master's degrees and this is why I love plotly because it's interactive when I hover over these bars I see I see a little pop-up information and we can really change that pop-up information other nice thing about plotly of course I can just zoom in and let's go back here you see at the top I can reset the axis and we're back where we are I can even save this as a png so if I'm writing a report I can just save this as a nice png to my hard drive so if I'm writing a report I've got this nice static image so it's all about Python and R so let's have a look at the different languages and I'm just written some code there just to show us a bit of the languages there's the total number of people who who responded saying what language they're using you can see R is by far the most commonly used language there but we also see RC and then SQL of course that is a structured query language for database coding so let's look at the coding environments that most people use and so certainly that was one of the survey questions and let's plot that out and we see Jupyter notebooks which this google colab is a version of is the most commonly used coding environment as far as data science is concerned so certainly if you learn how to code in in this environment you are doing what most data scientists would do let's have a look at income and there's these income brackets which I've just typed out there from the data and we're just going to plot that out so we just have a look at at what income these data scientists have and the vast majority of them are there in the 0 to 999 and what that is what that is about students at universities so they're not really getting salary for their work and I think that's where the majority are but you can certainly see this little spike here at 100,000 to 125,000 dollars per year mark and that's certainly a good salary there for for data scientists now there's this perceptual problem of woman being underrepresented minority groups of course by definition being underrepresented as a minority which is a real problem and which must be actively addressed so what we've done here is just look at these high income groups so those are everyone above 100,000 that's just the choice that we've made and just to see how many there are and if they above 100,000 we'll call them high income and all the others will be not a high income and then we're just going to for the sake purely of simplicity just compare men and women here so very simplistic analysis of gender there and if we look at a little cross table let's have a look at that so question two then was would then pertain to being in high income or not not in high income and if we just establish that or print that out as fractions they will definitely see that as far at least as the people who answer yes between men and women 6.6 percent there of the men and of all the men who are in high income group and much much less so of course women in the high income group we can do chi-square test for dependence between income group and gender and we see it is a significant result so I just took two countries as well say the United States and America of America and South Africa and just isolate those two countries let's have a look as far as income for data scientists are concerned let's go down there just print us out as a table here and we can see of course there are many more respondents in the United States at 2200 versus South Africa 141 and when it comes to being in the high income group there's of course proportion wise in the United States many more that are in the high income group and definitely higher paid salaries as far as data science is concerned in the United States now we can use a bit of machine learning a bit of artificial intelligence to predict in what income bracket someone is going to be so I've selected some of my variables to serve as predictors feature variables in my machine learning model here and we are just going to now use random forests so that is a decision tree so let's just install that so sometimes you'll see that google colab has not installed all the packages that we commonly use and sometimes when there's new ones such as these decision forests from google's tensor flow that we have to import install it at least and that's also very easy as we can see here with the pip install as I said don't worry about any of this code this is just an example so let's run that now because we've now installed it now we can import this we've got our classes of high income versus not high income so we see our two target variables there no yes and we're just going to separate those out and we're going to do a little split of our data keeping 30 percent as a as a test set so we can see how well our model does and that is what machine learning is all about the end of this course we're going to look at machine learning we're going to look at random forests and canyons neighbors as two examples of machine learning and what we can do there at the moment is to see that 92 percent are not in the high income group so certainly if I were just to suggest that everyone you know any observation that comes up is in that group I'll be correct 92.488 percent of the time because there's this big class imbalance so what we really need to do is to work at that we're not going to do that here all I'm going to try and do is beat that 92 percent with my model so I'm instantiating my model there designing and compiling my model there and then we're going to just run that model we're going to learn from that machine learning it's going to learn from the data and it's going to see we're going to see just how well it does now we split the data initially so we took 30 percent of the data out because that is data that the machine has not seen to learn from so that we can use that in our evaluation which we run there it's now evaluated and we can see that there's some keys available to this our loss function which we'll talk about at the end of the course and how accurate this was and if we look at the values we see we just about beat it at 92.789 percent so very simple little model and was able to perform slightly better than just just the baseline what I want to show you here is machine learning is something complex but a few lines of code and we can just run a machine learning model now this is a random forest so it's very interpretable so we can actually just plot out how this random forest worked how did it decide to put someone in the low or high income group as far as just the that prediction is concerned and we can also ask the question which of our variables the ones that we chose to go into our model you know which ones were more important than others or which one of those factors was was important in the prediction and you see the question three and did that for us so this in brief is just a little example but then into the introduction to the language to data science itself and to this coding environment that we're going to use in this course