 Hello everybody and welcome to chapter one of Python for everybody. I'm Charles Severance. I'm your instructor and I Welcome you to this class. The basic goal of this class is to teach Everybody how to program regardless of your background. You don't have to be a math whiz. You don't have to be a computer expert No matter how old you are or what your background is we want to teach you how to program So welcome to the course welcome to chapter one so The first thing to understand is that The purpose to learn to program is because computers want to do things for us They are built and created and designed and their hardware is set up so that they Basically ask us. What do you want to do next if you grab your phone? Your phone sort of does nothing until You tell it what to do it waits for you and it's just waiting for you and all the hardware computer technology around you is generally waiting for you and We can use this for useful things. We could play video games. We could have it help Navigate for our cars someday. We might even have self-driving cars and it's really in a sense in my mind silly if You spend your whole life not really understanding this technology and and I think it's important that we learn to Tell these computers what to do rather than just let them increasingly control our lives and so as We'll see computers aren't very smart on their own We humans are the ones that imbue them with knowledge and what we need to learn to speak their language It is much easier for us to learn to speak their language than it is for them to learn to speak our language Although with these cell phones We're starting to see little bits where they can begin to understand, but you would be amazed at the 40 or 50 years that it takes has taken us to Understand how to build programs to begin to understand so I'm bringing you into something Where you are going to learn the ways of programming and the ways of the computer because it's easier to teach you How to program and it is to teach this how to work in your world Even though ultimately the goal is to get this to do work for you So part of what I'm trying to do is move you from a user perspective where you just look at the computer as something that someone else has constructed and you are the user of To the point where you construct new things now the first kinds of things that you're going to construct are actually things to solve your own problems and It's a very popular now to work on data and Python is an excellent programming language for data Data mining and data analysis and that's a lot of what we're going to do in this course Although really it's a gateway to all kinds of things like you know Artificial intelligence or gaming or navigation or mobile applications or entertainment all kinds of things But first you have to learn to program we have to move from Using the computer as a tool to using the tools within the computer that allow us to change how the computer sees the world So there's a couple of reasons that you might want to be a programmer some of you are looking to Improve your career to be paid to work on programming. I've been a paid programmer most of my life, and I like it It's a good job. You don't have to Stand in the mud you don't have to lift things you have to use your brain And I'll just say that it is been nice for my career to not be exposed to the elements But to be able to work often wherever I want But that's actually our secondary goal our first goal is to get you to write programs that solve problems that you have to solve maybe you have a job as an accountant or a lawyer or something else and Maybe you run across some data Maybe there's some system that logs your time and it's not quite giving the report that you want to give And so you say could I just grab the log data myself and and write a program to do some analysis to say Well, what's the average this versus that or the average of some other thing, right? And so that's the basic idea that you'll you'll initially use computers to serve your own ends That means it a lot easier to write programs because you don't have to worry about You know a million users using your software you if it works for you then we're happy And so it takes a little more training to write software for other people or for thousands and thousands of other people And so part of what I want to do is I want to change your perspective You know you look at this from the outside and you see it from the outside and you click on things I want to turn this around and I want you to be the person inside this looking out at the world And as a programmer we are making things inside these computers for the world And so we want to pull you into being part of this we want you inside this or thinking inside this and What you learn is that if you're inside this Computer and you are taking your instructions to build programs to be used by the human Oops almost dropped that the human outside the computer you have things that you need to take advantage of There's things like the central processing unit the memory of this system the network connection of this system the the disk drive or permanent storage on this system and as a programmer you are kind of Mediating between all those internal resources that this has that are not very smart, but highly powerful and Mediating with what that user wants right and so we take the end user and we programmers we serve the end user But the computer serves us so together between us and all the computers resources We can serve the needs of the end user and we do this by writing code or Programming okay, and what is that? Well? Programming is a sequence of instructions where we are giving instructions to the resources inside the computer in a way To accomplish the goals of the end user and remember sometimes we are our own end user. It's not just It's not just you know the you're not always doing a startup. You're not always writing a mobile gaming system Sometimes you're writing something for yourself, but that's okay So sometimes you're writing something to solve a problem. You're like crafting your you're doing something that you could do by hand or manually and you're making some clever little 25 or a hundred line program and You're putting that in Other times like when I work on the open-source learning management system Sakai It is my creativity. I've got an idea and I want to share it with a million users And so I write my code to for an external audience And so code is that sequence of instructions that the computer itself doesn't know how to Handle roster out, but I can write code that will handle roster out by looking at the data that's inside This computer inside this application And so if you think about programs we have programs for computers and programs for humans and A number of years ago now. I'm starting sooner or later. This will be me showing my age This is an example of the Macarena and the Macarena is a song that effectively is a sequence of instructions You put your left hand out You put your right hand out you put it on the shoulder you wiggle wiggle wiggle and you spin around and you do things and this This is a program for people and so I want you to take a quick look at this and See if you can find anything wrong with this particular program So look really closely So I'll show you it's got some typographical errors in it and we as humans are really good at reading or hearing typographical errors and correcting them automatically and instantly and But computers are not computers are extremely literal if It saw this ham instead of hand it would think what's a ham and why am I going to hit someone in the back of the head with a ham and Why would I take my left hand and hit somebody? That's you know These are all bad things, but the computer is going to take us very literally and so we have to Be really precise and and the computer just doesn't know the difference between what we mean and what we say So we have to be very precise and this is one of the great Frustrations that people have when they first start using Computers and so we have to get this right We have to get these little bits of text exactly the way they are Computers will blow up with syntax errors and they seem to to make quite a fuss when you make the tiniest of errors But you'll get used to that. I mean that's because not because you're bad or you're less than awesome It just means the computers can't compensate when you make small mistakes And so you got to get used to the fact that the computer is sort of intellectually not as strong as you And so it gets confused really easy even though when it gets confused it says seemingly mean things to you So you'll you'll get used to that Okay, so the first thing I want to do is I want to throw up some text and I want you to while this text is up I want you to count the number of each word in this text and tell me what the most common word is in This text. Okay, so here we go Okay, so I kind of made that hard on you on purpose by moving it around and distracting you and confusing you But even if it's not moving at all, it's a little bit, you know Tricky to do you probably stare at it a couple of times your brain is going back and forth and back and forth and So let's text analysis is one of the great things that computers are very very good at And some of the things that you know they can translate text and that's because they've looked at a lot of information So looking at text is actually something computers are really good at and So if we take a look at the kind of programs that we're going to write to do this kind of thing This is something that humans are not naturally good at but computers are super good at now I'm not going to have you look at this code. I'm not gonna this code You will understand in a few weeks, but basically this is a set of instructions to open a file Read that file read all the words in the file Create a histogram of all the words in the file and then Search through that histogram to find the most common word and tell us what the most common word is in the file And in this clown file the word the is the most common and happened seven times And here's another large file called words dot text and the word too is the most common thing And our goal is to get to the point where you can write this on your own So you can say you know what I got a problem to solve that is what's the most common word in this file I know how to start and then I know how to finish I know how to do the stuff in the middle and we have to learn this kind of weird language, but when we do We can count millions of words as easily as a count 20 words So that's the fun of all of this is to teach you this language So that you can solve that problem so that you don't have to solve it because you could solve it But it's not something that's you're naturally good at and it's hard work So up next we're going to talk a little bit about the hardware architecture that you can you're going to be Experiencing as you write programs Hello and welcome back to hardware architecture now you might ask, you know, why do I tell you about hardware architecture? You're not probably you're not going to build any hardware Although it's fun stuff to do and if you're going to become a computer scientist, which most of you won't want to be It's a great thing to study and it's a those who build our hardware are amazingly talented individuals and it's a really rewarding job The reason I like talking to you about hardware is because I want to be able to use words at some point and say oh secondary storage or central processing unit or or random access memory or Peripherals, you know input devices and I want to be able to say those words And I want you to be able to understand them And so I got I'll start with a little piece of hardware called the Raspberry Pi and the Raspberry Pi is a cute little single board computer We as we go forward these things are smaller and smaller and smaller and the interesting thing is is that the Architecture of these stays the same, but the number of components drops so I'm going to start and give you a block diagram of Sort of a generic computer and tell you the major parts of it now I'm going to show you some really old hardware some really new hardware and then some hardware that is of medium age And the medium-age hardware is probably the easiest one to see the architecture is the same, okay, and so The basic block diagram is that the brains if there are brains in computers Which there really aren't the software is the closest thing computers have to brains But in hardware the closest brain the computer has is this called a Microprocessing unit or a central processor unit and this is designed, you know Three billion times to set up three billion times a second to ask the question What do you want me to do next and these little pins on the back are our instructions like 32 or 64 of these pins? three billion times a second we send an instruction into these things now We can't sit there and talk to it We can't and so the instructions we store in what's called the main memory and this memory is really fast and The memory sort of feeds this and so every time the CPU needs a new instruction It asks the memory where that instruction is and so this the memory feeds the instruction CPU The CPU does it says give me another instruction CPU does it gives me another instruction and that is the basic Essence of programming this asks what's next and this is where your program is stored or a program You purchased or came with your hardware where that's all stored and those are your places And so you end up inside your programs end up inside this memory So then there's a I mean and so in software you tend to program them the CPU And if you had bought a desktop computer a number of years back It would have this thing called the motherboard and the motherboard is called this because it kind of connects all the components together And so if you buy memory by itself It does nothing, but it has a place to plug into the motherboard and if you buy a microprocessor It has a place to plug into the motherboard and if you Buy a hard drive This is a really old hard drive It has a place to plug in on the motherboard and so the motherboard sort of connects everything together The hard drive is secondary storage now the way sec what the house secondary storage is different than the main memory which There it is. I got Unpiled this stuff So this main memory is really fast But as soon as you turn the memory the power off of this memory it sort of vanishes And so to store files like word processing files or text files or whatever You got to start on something that lasts a little bit longer And so that's the purpose of the secondary storage. It's permanent when the power's off It stores it now this one here is in such bad shape that isn't probably storing anything But it's got these little heads and it spins around and goes in and on We'll have a video later that shows you one of these things that's not quite in as bad a shape if you look this has Four different platters that are all spinning around and so this is just using magnetic material and electronics that sort of Magnetize and de-magnetize this stuff and if you look at a disc they're they're often rated physical discs are rated in revolutions per minute And that's how many times this thing spins around if you got a old desktop and you hear it spin up This is the thing that's spinning and it's the place that your operating system lives your files live your applications live while they're stored and while the computers turned off and then they're loaded into this while they're running and then This CPU takes the data from the main memory and your program runs at three billion operations per second so Let's talk a little bit about something that this is probably from the 1960s or 70s This actually has if you're an electric electrical person it has capacitors those little little Silver things are capacitors these little colored things are resistors and that's more capacitors And then there's wires and wires move everything and so when you say like this has millions of transistors. Oh wait That isn't a capacitor. That's a transistor. That's a transistor when you say that this here has etched And if you look closely at this go look at a picture of a microprocessor online You will see that it has millions of these and so the difference between 1960 and today is this circuitry of capacitor capacitors resistors and Transistors has been Micro-ized and put onto this it's using Photographic process and they're tinier and tinier and putting more and more on and if you think going from Millions of these to one of these is crazy The thing that's happening now and the reason we have whole computers inside our pocket is that Everything all of this this whole thing CPU memory Everything all of it connected and the storage is Being made smaller and smaller and so this little single board computer called a raspberry pi has one thing in it And it has the main memory and it has the CPU It has connections for things like peripherals like keyboards and stuff now It doesn't yet have secondary storage on it the secondary storage gets plugged in right here via USB And then if you take it one step farther to my phone It's got the secondary storage built right in and so you know these this picture Goes from the size of cabinets in the old days all the way down to really tiny but at the end of the day inside it is a highly sophisticated piece of circuitry that asks for instructions one at a time and Main memory that holds the instructions and feeds them. Okay Central processor does the thinking take a look here central processor does the thinking it runs the program It's what asking what's next. It's not really it's not really smart, but it's really fast and so we compensate for The lack of intelligence of this thing by us writing really good software that runs really fast And so voice recognition on things like phones is possible Because computers have so much storage and they run so fast and the algorithms that do voice recognition are finally starting to work Input devices like keyboards and mice and pens and whatever they come in Output devices are like the screens that we see the main memory is the is the fast part of the computer that stores all the programs and The secondary memory is the permanent storage increasingly secondary memory Do I have any USB sticks in here? I? Don't well increasingly secondary memory is flash RAM or or static static RAM Be with no moving parts and so so you in a few years you'll not even be able to see secondary memory with with moving parts, but that's okay. It's still secondary memory. It's still memory that lasts and So you and where your place is in here is you live in the main memory? This is you you are here and so in a sense when the CPU asks the question What next it is your job to answer that and you answer that by writing Python code and so your Python code You'll write a file in Python code blah blah blah blah blah blah blah blah, and then that Python code sort of gets Loaded into main memory. There's a magic translation process that happens And then your code is actually answering this question three billion times a second three billion times a second You're sitting there, but this is you you're really out here But you then write a file and the files loaded in and then the file runs and that's how things are at and that's That's your place in the world now What's actually running is not Python code there is as I said a translation process you write a Python file and Then Python itself translates this into the actual language known by The microprocessor which is a series of zeros and ones called machine language someday I would love to teach you a class on machine language, but for now we're going to teach you Python and we're going to use Python as a crutch We don't have to talk machine language, but you could if you really wanted to you could know how to write machine language But I assure you Python is far easier to learn than machine language So Python acts as a translator translates what you're doing into machine language And then the machine language is what's sent back and forth But still even though it's translated to machine language It's you it is you answering those questions and that's what a program is as you Pre-storing your response to the what next question over and over again So here's a couple of videos that you can look at on YouTube about a CPU these CPUs and looks very much like this CPU that I've got with me these CPUs run Extremely high heat when when you put this thing or your computer on your lap and it starts to heat up that means it's thinking really really hard and so this is Small little old video from a long time ago that shows what happens when you take out the cooling capability of microprocessors and how just how hot they can be and the other video that I have is a hard disk something like this hard disk that I have Except that it works and they turn the power on some of them last for a few seconds some of them last for a few minutes It's never I must be allergic to this hard drive Or maybe maybe it's cuz there's dust in this hard drive and I keep spinning it and I sneeze but But but basically some of them last for a few seconds some of them last for a few minutes It's not a good idea to open them up But I'm glad somebody opened it up and then did what they did and then recorded it so we can all enjoy What it is that they're capable of doing okay? So that's a quick introduction to hardware Mostly so that I can use those words going forward now what we're going to talk about next is Communicating in the language Python that is writing code and putting it into the computer so that that can execute okay, and Welcome to my video that shows how to get started and install Python On Microsoft Windows, okay, so it's not too hard. We're going to both install Python 3 and we're going to install a text editor and so I'm just going to go into Google and I must I install Python 3 and my top link is downloading Python and There is my link for downloading Python 3.5.2 This version of my class uses Python 3 I have an earlier class that you may have seen that uses Python 2 But in this class we're going to do this now. It might take you a while to download this I've actually already downloaded it now the other thing we need is a programmer text editor and You can really use any programmer text editor. We've not we've used notepad plus in the past We've used J edit in the past. I like Adam Adam dot IO Tom dot IO mostly because it works the same on Windows and Mac and Linux But you can really lose use any Text editor that you like just don't use word or or text edit it that comes with the operating system You need a programmer's editor that doesn't mess with weird characters or weird lines or strange formats You must have a real programmer editor So I'm gonna down. I've already downloaded this as well and So I won't waste the time waiting to download it, but let's go ahead and do the installation. So these things These things ended up in my My downloads file so go to downloads and I'll start installing Python 3.5.2 Now it's going to ask me some things And add Python 3.5 to the path and that's a good idea install the launcher for all users I'm going to add that maybe you will maybe you won't do that. It's going to tell me where it's going to install it install Of course, it's going to ask me for permission to do these things And now it's running through the installation Okay, so there we go. You could maybe click on this online tutorial and documentation but we're just going to close this and I'm going to start and run the Windows command line now You may have all kinds of fancy ways to run Python, but I like running the command line C O M M A N D. I like running the command line because You it's after a while. It's important to know what folder things are being run in And so here's this command line and I should be able to type Python here and so now I'm in Python 3.2 and This is the chevron prompt here is the Python interpreter where it's asking for Python commands and I can say print World of course. This is what we tend to print all the time. I can make a mistake. I can say Little blue right and it'll complain to me now to get out of this I can either type control Z or quit in this case I'm going to type control Z and I'm back to the prompt a couple of things I can do a dirt to see what folders and files I have and that is like my desktop And then the CD command tells me where I'm at in the folder. That means I'm in the user's directory Dr. Chuck. Okay, so I have now installed Python. I ran the Python interpreter to verify it I said print. Hello world. Okay, and so now what I'm going to do is I'm going to actually install Adam. I Already had this downloaded. So let's go ahead and install Adam on my computer so Adam is now installed and is kind of Telling us what to do. I'm going to actually just Close all these windows close this window close everything and I'm going to create a file I'm going to say print in this case Let's see if I can make this I can make it bigger. So I'm going to type print Hello from a file Okay, and I'm going to save this I'm going to say file save as and what I'm going to do is I'm going to go to my desktop and I'm going to make a folder on the desktop I'm going to call this folder P y for e So I now have a folder on the desktop Move this here. I'll move this here oops, and I'm going to go into P y for e and Then I'm going to name this file first dot P y and you'll notice that when I save this When I save this It it syntax highlighted it. That's one of the nice things about a programmer editor Okay, and so it says oh, it's got a suffix of dot P y So therefore it knows that it's supposed to look pretty with Python and make this one color make this another color The other thing that you'll notice is I now have a folder called P y for e and If I am in this command line, let me just start that up again. I'll show you how to start the command line again command Now if I do a dir I see the folders that I'm in and one of the folders that you can see here is the desktop folder So I'm going to say CD desktop and then I'm typed a dir command to see what folders are in the desktop These folders are the same as these folders. These things are kind of virtual folders P y for e is P y for e now. I can type CD which chance for change directory P y for e And I can do a dir and I see first dot P y and that's the same as if I'm Diving into this folder. Here's this file first up P y windows hides the suffix Which is somewhat annoying and frustrating but? That suffix is there that file is there there and so for me one of the things you got to figure out in windows Is how to make sure that you are in the same folder? Users dr. Chuck desktop pie for e and that's this name in this file and here as well And now I'm going to run this program. I'm going to type Python First dot P y And you see that it ran the Python code, okay another way you can do this Is you can type first dot P y and that's because this file association has happened in windows This doesn't work in Macintosh This only works in windows that all files with dot P y are expected to be Python and it knows the Python interpreter where to run Okay, and so I've got Python 3.0 installed and that gets me started and so I I Hope that this little introduction about getting things started and writing your first Python program has been helpful to you We're going to actually download and install Python 3 from Python.org on a Macintosh If you your Macintosh for years has wonderfully come with Python 2. So if I type Python minus minus version then I type that I see that I'm on got Python 2.0. What we want to do is in addition install Python 3 one of these days Macintosh might upgrade their distributed distributed version of Python 3 But there's so many things inside Mac that depend on Python 2 I'm going to expect that it will always be named Python 3, which is what we're going to call it in a second So here I am at the Python or downloads and I'm going to download Python 3 You click here, and I'm actually got it sitting here in downloads already because I always do that and so So I'm going to install this There is the installer I'm going to say continue continue continue. Of course. I agree. I read all that really fast and now I'm going to install it Okay So now that means if I run a terminal So this of course is start run terminal. So Python 2 is still there But Python 3 is also now involved there. So we should have Python 3 installed So we installed Python 3.6. And so there we go. And that's all it takes to install Python 3 on the Macintosh So let's write our first little Python program I'm going to I like Adam And So I've got this Adam editor. It's Adam.io Right here Adam.io download and install the Adam editor I like it because Adam works the same on Both Windows Mac and Linux and it has syntax highlighting and so I really like things like that So I'm going to make myself on a simple Python program Hello world like we always do now. You'll notice that it's not syntax highlighting yet, but I'm going to do a file save see daisy file save as and I'm going to go into my desktop and I want to make a folder called py4e I must kind of this call is hello dot py Oh crud got to rename it Rename it Ended up with two dots hello dot py there we are and so now I'm here and I'm in my home folder I can go in my desktop and I can go into that new folder I made Python for everybody and I can see the files Now you there are ways to run this and I really don't I really want you to learn the terminal So that you really know what you're doing And so here we are we are in the folder that has the Python and then all we do to run it is we say Python three Hello dot py and there we go and of course this is Python three because I'm using parentheses there, so Instead of double quotes, but Python two is still there and of course if you just run Python hello dot py It'll be a syntax error Or not must be they added something Yeah, because Python is still version Still person to but apparently they allowed print in the latest version of Python to so so away we go Okay, so again, thanks for watching. I hope this was helpful to you to get Python three installed on your Macintosh Hello, and welcome back to Python as a language. You'll notice that I'm wearing a hat and part of the story of the hat is that where I work here at the University of Michigan School of Information we my office is in this building called North Quad and the we call it quad words sometimes because it's sort of got a square it sort of imitates an Oxford quad and So it seemed to me to evoke notions of Harry Potter and when we first moved into the building I joked in one of my classes that we should have a sorting ceremony for all the students as they come into North Quad for the first time and and so that was cool, and I thought that I would be long in Gryffindor like everyone wants to be in Gryffindor right there the good guys and my students told me that I couldn't be in Gryffindor That I had to be in Slytherin so you'll see me drinking tea throughout the course out of this teacup. It's my Slytherin teacup Got I picked that up from Harry Potter world. I went down to to Florida and visited Harry Potter World and The reason that I am Was sorted by my students into Slytherin it's also because I teach Python and Python is like a snake and so If you think about the people from Slytherin they are capable of talking to snakes and That class that we were doing the sorting was a Python class and so it sort of made Perfect sense that you would have to be in Slytherin if you do you were the Python teacher and of course Your name is Charles Severance and then that sounds kind of like Severus Snape And so I just accepted that I'm in Slytherin, okay, so you all can be in Gryffindor, but I can't I'm in Slytherin So I'm the bad guy or the good guy depends on how you look at it, right and So what I'm going to do now is I'm going to you know bring you into Slytherin as well Because I'm going to teach you the Python language Python is the language that we Python East does talk it was invented about over 20 years ago by a fellow named Gido van Rossum and Away we go now Even though I'm using this whole snake Slytherin thing It turns out that Python was not at all named for Harry Potter because Python was invented You know almost two decades before Harry Potter was created and it wasn't for the snake It was actually Monty Python's Flying Circus was the inspiration For Python the name Python and I because Gido van Rossum really wanted to create a programming language that while it was powerful Underneath it in its very nature was a very powerful language He wanted it to be a language that was fun and he wanted it to be a language that was approachable And so that's why Python recently has become so absolutely popular and It's easy to learn and it's but it's also powerful and that's sort of the magic of Python is the ease of learning it the the brevity of the programs the shortness of the programs and The and the power and so we are going to become Python East is now As you learn to be a software developer using the Python programming language You are going to encounter Syntax errors and I remember when I used to get syntax errors and I remember my first programming class and I would type on cards and I would Upload those cards to the computer and the computer would say you're not worthy And I'm like wait a sec. Those are pretty good cards. How could you be so critical of me? You know, I'd say syntax error and I really got sort of a Really bad attitude that somehow this computer didn't like me and that I would make cards It would complain and I would make changes to the cards and it would still complain and I make changes It would still complain. I'm like, how can I win in this situation and you're gonna feel the same thing You're gonna absolutely feel the same thing. You're gonna be struggling. You're gonna be like, how come this computer hates me? Let me show you right now. The computer doesn't hate you. The computer actually loves you It just is not very good at showing how it loves you or telling you how or why it loves you And so syntax errors are not so much Python telling you that you're bad or that you're an inadequate program or you should find something else to do It's really Python's admission that it doesn't understand what you're trying to say And so you got to get used to that and it's frustrating But you got to get used to the fact that syntax errors are your friend Python is saying hey, I got to line 7 and I was doing fine up to line 7 but boy in line 7. There's some little thing I don't know what the word Else means in this context or you didn't indent it. And so I'm kind of confused. What did you mean? Please please please help me, you know And so it's so much easier for you to learn Python than it is for Python to figure out what you mean when you're writing code So we have a number of different ways to sort of encode our instructions when we talk to Python One is we just run Python interactively on our computer Hopefully by now you've got it installed and you just type Python at a command prompts or either a windows command prompt Or a linux command prompt or a macintosh command prompt And I got some examples of how to sort of get this all started get Python installed and away you go Now you'll notice when you run the Python interpreter the three chevron prompt Python is asking you what next right? This is you it's saying I want to talk to you I want you to tell me some Python to do if you know the Python language. You know what to say right here Now if you know Python you can type these languages you can say oh x equals one Which really means go find a little piece of memory label it x and stick one in it Print x is like go find that thing where you label it x and bring me back that number and tell me what I stored in there Now why you want to do this? That's a different question. These are very simple things It's going to take you a while to get the big picture of why we're doing this So just trust me that you want to learn These statements and then later we will successfully turn those into a program So x equals x plus one the third line there x equals x plus one is not as it seems in math It basically says hey go grab the old value of x add one to it and stick it back in x That's what that means so it equal sign Really has kind of an arrow to it and then we say hey go look up that x thing that we just did and print that out And then we're going to say quit So that's us talking to Python now you can type just about any crazy stuff you want in here and python will be unhappy and talk to you so What we're going to do next is we're going to start talking about the actual language of python And what it is that we have to say to make python happy when we're talking to it So now we're going to start learning the actual python language. So what do we say? You can think of this as almost like writing almost like writing a story We're going to start with a basic vocabulary We're going to talk a little bit about lines or sentences And then we're going to start talking about how to put those sentences together to make a coherent paragraph as it were And you just have to accept the fact that when I start teaching you this stuff It's not going to make sense for about six or seven more chapters. And so just sort of bear with me Except I mean I remember when I first learned I it went from me confused confused confused confused Holy mackerel, this is awesome. And so I expect many of you will go through that same thing. So just Learn the first parts except the fact that it doesn't necessarily make sense in a big picture And and just just bear with us. Okay, so we'll start with vocabulary We'll start to make sentences and then we'll have little short stories and paragraphs Okay, and so this is a short story about how to count the words in python It's got a couple of paragraphs and we are going to look at all of this stuff eventually So we start with a set of reserved words And what are reserved words? Well, they're words that python Expects when you use these words that they're going to mean exactly what python expects to mean And what it really means is you're not allowed to use them for any other purpose than the purpose that python wants It's sort of part of the contract It's like when you have a dog and you go What did you think of that television program and the dog has no idea what you're saying and then you say Do you do you want to wait until Saturday to to To go to the veterinarian And the dog still doesn't know what you say and then you go like How would you like to take a walk and then the dog goes walk? I know what that means and then hits the door right and so The way the dog sees you is blah blah blah blah walk blah blah blah food blah blah blah treat blah blah blah walk That's kind of how python looks at these reserved words when you say class it goes class. Oh, I know what that means Now if I say zap, it's like oh zap something that you get to decide or it's a maybe a variable name So reserved words are simply words that when you use these words in python and there's only a few of them like and or del Or if maybe pass maybe in A lot of these you you won't end up using them as just these are reserved for python and part of the python vocabulary This is python vocabulary now When I when we move from words to sentences You see that python is a series of lines a python program is a series of statements They have an order because the computer wants to know what next what next what next So what next is start at the beginning? So I already talked about an assignment statement that basically says x equals 2. This is not a mathematical statement This is a A directive to say take this variable to this value to this constant to and stick it in a location in your memory And remember that I asked you to name it x x is a variable something you made up you chose that And so it but it's python's job to remember it So this says go do whatever that x is There's a 2 in there now pull that x back out Add 2 to it Which makes it 4 and stick it back in x and so that makes this a 4 so x is a 4 and print x says Go look up that thing that was an x and print it out And so these are like each line has something to it. I'm using a reserved word. Well, actually that's a function But it's kind it's a reserved word too um, and so There's reserved words and all these things and you combine these there are operators plus as an operator equals as an operator these things do things and we'll learn all this stuff In time so the basic building blocks of lines of python Now as we take these lines of python and build them up we end up making paragraphs programming in paragraphs And so one of the things that uh, it's important is I showed you how to do interactive python So you just type python and you type a statement and statement and statement Those get really tiring after about three or four lines of python because you start making mistakes And you have to start over so the the better thing to do is to as your program gets a little larger to write a script Put your python instructions in a file and then tell python to read from the file and then run the script As it's entered in that file We tend to name these files with dot py and I've got a series of videos that you can watch to figure out how this all works Like I said, you can type interactively to python and it's a great way to experiment with python Check to see if a statement does what you think it does But script is the way after we are past one or two lines of code We write it in files and then run it separately So there are a couple of basic patterns and um, it's really important to understand each of these patterns And like I said, we'll teach you these patterns separately and then we'll combine them together And when you combine them together is when you say, oh, that's what a program is So you have to suspend disbelief We have a couple of different patterns. One is a sequence of steps Do this then do this then do this conditional is like skipping something Repeated does it over and over and over again computers are really good at repeating stuff Much better than people people get tired going over and over doing the same thing And then we have store and repeated steps as well And so if we take a look at this and we take a look at a python program Um, this is a piece of code. This is a little script If you type this into the code take this code python code into a file and ran it It starts at the beginning and then it goes to the next line in the next line in the next line And python executes the scripts as you write them. So it says stick Stick a variable to find a place called in your memory called x stick two into that Okay, then go to the next one print that out. So the program is producing output Now go read x and add two to it and stick it back in x so x is for then print that this side over here This is called a flow chart. I'm not going to make you draw flow charts I'm only going to draw them a few times that in ways that I think will help you But you can think of it as python when it finishes something it goes on to the next one unless you tell it otherwise Finishes this goes on to the next one finishes this goes on to the next one finishes this and now the program is all done And so that's sequential steps. You just type them in python runs it They're they're important, but sort of uninteresting because you know they're you can't can only get so far And you can't really make them intelligent because it's always going to do the next one So the next thing we do is what are called conditional steps and this is where it Starts getting intelligent. I mean where you are able to encode your brain into the computer like oh wait a sec Let's only do this step if something is true And the the syntax that we tend to use here is the reserve word if if okay And so the if is like a little fork in the road You can go one way or you can go another way and you're asking a question So inside the if statement right here. There is a question saying is x less than 10. That's a that revolves He's resolves to a true or false if it's less than 10. That's true. If it's greater than 10 It's false and so then what we do is If it's less than 10 we have this indented block of code There's also this colon tells us we're in the beginning of an indented block of code And so what it basically says is if this is true run that code if it's false skip that code So it can either run it or skip it depending on this question that's being asked Now if you look at this code, it's pretty obvious what's going on It comes down x is five if x is less than 10. That's true So it runs this code and prints out smaller And then it comes back here at de-indense the next basic sequential. This ends up being kind of a block If x is greater than 20 If x is greater than 20. Oh come back come back If x is greater than 20, this turns out to be false because x is five and so it skips this So the bigger never comes out and then it continues on and prints finny. Oops That's a typographical error make that a lowercase print and then prints finny So it comes in runs this skips this and then finishes Okay So here is the last one we'll talk about the repeated steps. We'll get back to store and retrieve store and retrieve Later, but for now we're just going to talk about three of the four This is another program and the key is is that We're going to use this same choice where we're going to go in but then we're going to run for a while And then we'll have an exit condition where we get out So this is a repeated over and over and over and over again and this is The essence of how we make computers do things that are seemingly difficult while they're more naturally difficult for people Okay, and so how do we encode this notion that we want to do something for not forever But for a while, how do we encode that notion? And so we do it in this way So we have our statement sequentially go to this while while is a keyword And it's asking another question. That's a true false question. Is n greater than zero I read this as as long as n remains greater than zero keep doing this indented block and you have a Colon at the end and then you have two lines of code that's indented So that tells us what the loop is and then this is now de-indented And so it comes in and if this is true If this is true if this is true it runs these two lines Prince out in n is five and then it says n equals n minus one Which makes n be four and it goes back up and it goes up and it asks this question again Is n greater than zero if it is continue on and prints four and then subtracts it and it does that four three Two and prints out one then it comes up and now After this n is now zero n is now zero And n is no longer greater than zero. So it takes sort of the exit ramp and goes down here So it takes the exit ramp and goes to here and runs the next line now We're going to cover all this again. So I'm just trying to give you the big picture next couple of chapters We're going to hit all these things again, and we're going to hit them in much more detail with a lot better information This is now sort of like combining these and again I'm not I don't want you to really like know this stuff Just you will know this in a couple of weeks. You will see this program again But this shows you how we combine those patterns of repeated sequential and conditional together And so this is a bit of sequential code comes in here It runs this which happens to ask for a file name then it opens the file It creates a data structure called a dictionary. This is all sequential now the four is another form of loops So this is going to loop for a while and then this is within a loop We can even have two indents and that's another loop. So these are like Repeated and then it goes it's down to the next sequential bit. Then it does this Here's another loop that's going to run and then here's a conditional It's going to run and then when it's all done we print out the last thing and this is of course Is that the program that does? You know the It figures out the most common word and prints that most common word out And so this is a python short story. It reads some data It reads a name of a file. It opens that file. It talks about how to make a histogram And then it looks through for the most common word. So don't worry too much about this Over the next couple weeks we'll fill in the pieces so that you absolutely understand every single line of this code So this is a quick overview chapter one Stick with us. You realize it will be chapter seven before this makes too much sense You really have to trust that you are learning important things and that it all makes sense When we bring it together like in chapter seven in a few weeks Hello and welcome to chapter two Now we're going to continue to talk about the building blocks of python variables constant statements expressions, etc The first thing we have to talk about is constants. These are just things we call them constants because they don't change There are numbers strings, etc and we use them to sort of start calculations or you know if If something is greater than 40 hours, we're going to do something and so 40 is the constant in that situation So we have 123 we have 98.6 We have hello world, which is a string by enclosing it in quotes We pass each of these things to the print function and a side of the spec to the print function is that we see the output So print 123 prints out 123 print 98.6 prints it out So these are just really the syntax of constants and without constants. We can't write really much of anything The other sort of foundational notion of any programming language are the reserved words And like I said before reserved words are these special words where python is listening for them And there are very special meanings. So when python sees if It's not just any other word. It means how python implements conditional execution Variables are the third building block and that is a A way that you can ask python to allocate a piece of memory and then give it a name And you can put stuff in that sometimes you just put one value later. We'll see when we do collections in chapters 8 and 9 We will see the more than one value can be put into a variable And the variable the how we control the variable is through the assignment statement And as I said before it's important to think of the assignment statement as having an arrow to it So this is not saying x for all time is the same as 12.2 What it's saying is take 12.2 Find a place find some memory in your computer there mr. Python Give it a label x we get to choose the x. That's the variable part. We chose it, right? Um and then stick 12 in it and then the same is true for 14. We'll find another another spot Name it y and then put a 14 in there So think of this as an arrow every time you see that equality The assignment in an assignment statement Now these variables hold one value. So now if we have these Three statements these two and then the third one executes. It says put 100 into x But that wipes out the old value of 12.2 and it rewrites it with a hunt with a hundred and so we can Change the variables. That's another reason that we call them variable There are some names now. There are some rules for making variable names You can start with a letter or an underscore We tend not to as normal program is use underscore. We tend to reserve those For variables that we used to communicate with python itself So when we're making up a variable we tend not to use underscores as the first character You can have Letters and numbers and underscores after the first character and their case sensitive, but it's really a bad idea to Use case is the only differentiator. So in this case Spam eggs spam 23 and underscore speed are all toward legit We would probably not use this one unless we were actually doing it because python told us to use that variable 23 spam starts with a number pound signs is starts and dot is not a legitimate variable character And spam capital spam and all cap spam are different, but this is not something that you want to sort of depend on too much So that's just the rule names We tend to start them with a letter and then use letters numbers and underscores underscores other than the first character are Generally pretty common and you'll see those used a lot Now when we're choosing variable names one of the things about variables is we get to choose the name We get to choose the name x choose the name y and so sometimes we like them short but sometimes we want them descriptive and the notion that of making variables descriptive is often confusing to beginning students sometimes It's really helpful to if you're going to have a line of text and you name the variable line That's great because the next person reading your program says. Oh, that must be the line of text whereas It also can become misleading that line the name of a variable somehow has meaning And so sometimes we'll have even singular variables and plural variables like friend and friends You know like is is plural does python know about singular and plural plural and the answer is no So sometimes we pick variables that make no sense Sometimes we pick variables that make a lot of sense This is just something that you as a beginning program are going to have to understand That we can pick anything we want And so you'll see i'll i'll try to call attention to this in the first few lectures as we go through So here's a bit of code with an assignment statement two assignment statements a multiplication and a print statement And you can say what is this doing now? Python is perfectly happy with this code because it assigns it in there You have said please go give me this as a label and then we assign two variables And then we're carefully pulling these two variables back out Multiplying them together and sticking them into yet another variable and then printing that variable out that seems like You know we can figure out what it is. You just have to look really carefully and a single character mistake and python is going to be You know pretty unhappy. Okay, so that's One way to write this program. It's hard though because you any of those characters are long variables and they're random stuff It's not very friendly to anyone who might read your program Now this looks a little friendlier It's the same program because python just wants a correspondence You picked a you picked b and you picked c and it's really much easier for us to see what's going on and And so this is in a way going from Here to here is much friendlier, but We can be even friendlier if we pick mnemonic variable names. So this is this is not mnemonic This is short and convenient. This is long and inconvenient python is happy with any of these Here on the other hand is another version of the exact same program and now you think to yourself. Oh, yeah now I get it 35 is the number of hours 12 dollars and 50 cents is the rate And then we're going to multiply the hours and the rate and come up with a pay and we're putting out the pay Now whoever wrote this program is much is helping us greatly understand what's going on and that's good Choosing variable names a python again all three of these are the same to python Choosing variable names in a way that help your reader understand what's going on is a great thing The problem is the danger is If you read this and you think that somehow python understands payroll That if you name a variable hours that python knows what hours means The answer is python really doesn't care what you name the variable as long as what you name it You use it, right? And so you got to be careful and so you'll see I will When I write my code in these first few weeks First few lectures. I will sometimes write it with gibberish I'll sometimes write it with extremely short but meaningless variable names And sometimes I'll use meaningful variable names and I'll call your attention to it And and it will get you you'll start when you look at this third kind It has meaningful meaningful variables or mnemonic variable names You'll just instinctively want to give python more intelligence Than it sort of deserves. I guess that's probably the best way to say that So we've talked about constants. We've talked about reserved words. We talked about variables And so here we have a sentence like we've already done some of these things where we set x equals 2 We retrieve the old value of x and add 2 to it So that becomes 4 and then we print 4 out print is a function that's built in And we pass in whatever we want to pronounce. So this parentheses is part of a function call Okay, so an assignment statement you have to really get it your head around the notion that it has this arrow nature and that it valuates this entire right hand side before We change the left hand side and so you can think of this sort of as at time step one It does this and then at time step two it does the copy and that's how you can have something like x on both sides of assignment statement And so if for example, we have x and x has 0.6 in it x has 0.6 in it What happens is is that it first it sort of ignores this part right here and evaluates the expression So it pulls the 0.6 everywhere x appears it pulls 0.6 out Then it starts running these calculations And then it has the new value after all the calculations are done Then and only then is it going to put that back into x And so it sort of takes that and puts it back into x and then Wipes out the old value at this point This has all been taken care of and it's been reduced down to this 0.93 And so that is what's put in as the new value So up next we'll talk a little bit more about making more complex expressions So welcome back. We're now going to talk about expressions expressions are a little more complex Calculations that we can sort of do on the right hand side of assignment statement So one of the things about Expressions is operators and the operators in computer programming are often very much the same as the mathematical operators But we don't have all the fancy characters that we have in mathematics And so we have to choose what's on the keyboard and then we really go back to the 1960s and 1970s And then we used what was on the keyboard in the 1960s and the 1970s to make these operators So pluses addition minuses subtraction We don't have a time sign or a or a dot in the middle. So we use the asterisk as multiplication Division and we can't put two things over top of each other. So we use slash for division Raising to the power because it didn't have little characters back then is star star Which is raising to the power and then remainder remainder is the when you do integer division It's also called the modulo operator. It's the remainder not the quotient. I've got a picture of that coming up So here's a whole series of little examples of this Right, so we've already seen, you know the plus x equals x plus one keep remembering that these assignments are arrows Basically arrow arrow. They have a direction multiplication 440 times 12 Dividing this by that's division over over a thousand five point two eight Here we're going to put 23 into jj and then we're going to do modulo So that says take 23 divide it by five and give me back the remainder and put it in k k So this is the expression that evaluates like this take 23 divide five into 23 for remainder three The three is what comes back up here. Okay, and so that is the remainder. It's also called modulo operator it turns out that For things like picking a random number and then taking the modulo of 52 is a way to pick a card randomly So this modulo operator is actually Especially in games and other things super useful. So That's the various operators It's important to know which of these operators goes first It's called operator precedence Now normally we put parentheses in like, you know, the so if I put the parentheses in here I'd say this goes first parentheses then this goes first. Oh actually not that one. Whoops got that one wrong This happens first this happens then this happens Okay, and so But it's important for us to be able to know if there were no parentheses the order in which these things will happen So the way things work in terms of operate operator precedence is parentheses are the most important thing Followed by raising to the power all else being equal multiplication and division are all both equal and then addition and then within it's adding left to right So let's see an example of how this works And so if we take one plus two to raise to the three power divided by four times five And we print out what comes out of this. So The way I did this when I was taking exams back Many many years ago when I was first in computer science Is I'd write it all down and I'd look for the highest precedence thing now parentheses would make this easy But exponentiation is the first one. So that means we're going to take this And that's going to be eight two to the third power two times two times two to two Tubed is eight Then what I would do is I rewrite the whole thing with the eight there and now I look across And I'm I'm looking for multiplications because the power has been done the multiplications what I'm looking for next And then there is both multiplication division. They're equal. They're at the same level And so what happens is they're done left to right eight divided by four happens before four times five And so the fact that it's not four times five But instead eight times four is because of the left to right rule So then this gets rewritten to be two one plus two times five And this one multiplication is the top one. So that does this next two times five becomes 10 I rewrite it again and then one plus 10 addition is the lowest thing and that's how we end up with 11 And so that that's how I would do these problems if I ever saw the problem on an exam And it's a fun problem to put on exams because There is one and only one answer and every programming class has usually at least one slide about this stuff So like I said the rules go top to bottom parentheses power multiplication addition and then left to right within it So we talked about variables and computing values to put inside variables But the one thing you've kind of also maybe you noticed it as we go by as we have Different kinds of data. We call it type. Is this of type integer? Is this of type floating point number? Is it of type string? What is going on here? And python is pretty smart about various kinds of types of data And so, you know, we're adding one plus four here And python knows as it looks at this that that's an integer and that's an integer And we'll add it together and make it an integer. So that thing is an integer We can also use this plus to concatenate two strings. This is hello blank Plus there and plus looks here says, oh, that's a string and that's a string So I know what to do with strings. I will concatenate those two things together So it becomes another string that gets assigned into ee E and it's hello space there. The plus doesn't add the space. I added the space by putting it right there And so these operators are kind of smart in that they kind of know what they're dealing with and sometimes they will Do one thing or another depending on the kinds of values variables or constants that they're working with And so sometimes type can get us in trouble So here we have ee which is hello there because we've concatenated these two strings together And now we're adding one and the problem now is that it looks on one side and says that's a string And that's a number and says I don't know how to do that This is another one of those annoying errors that you would like it You think that somehow python doesn't like you, but it just is confused If you look at these things trace back trace back always means I quit It means I stopped I ran I I'm quitting now because I don't want to go any farther because I've become confused So so your program stops running and you say here's where I stopped running because we're typing interactively It's always line one here type it but you for read carefully And you don't get too stuck on too much stuff line one that tells us something In module type error can't convert int object to stir implicitly So that's an integer right there and that's a string and that's what it's complaining about that little bit right there If python is so grumpy about types, then we should be able to ask it about types So there turns out that there is inside python a built-in function called type t y p e So we can pass into type. So this the syntax is calling a built-in function named type parentheses is The parameter that we're passing to it. We're saying hey, hello Tell me something about the type of the variable e e e And so this is a function the parentheses are part of the function call And it says oh that would be of class string And then we can pass in a constant and says hey, what about hello the string hello. It's like oh, that's a string two What about a one? Well, that's an integer. And so we are asking python through the type function What the type of either a variable or a constant is? And there are even several types of numbers and we'll even see booleans and others like later Like one with no decimal. That's an integer number 98.6 with a decimal. That's a floating point number. And so, you know constants and Constants can be both integer and floating point. And i'm just asking over and over and over again What is the type of what's in xxx? What's the type of what's in temp? And what's the what's the type of the constant one? And what's the type of 1.0? You can also use a set of built-in functions like float And int to convert from one to another and so this basically says I want to convert oops Let's go back. I want to convert 99 to a floating point number. So this is a function and it's participating in this Plus, but before it can finish the plus it turns this into a 99.0 The difference in 99 is an integer 99.0 is that it's a floating point number and that actually turns this computation As it looks to the left and looks to the right says, oh I've got a floating point number on one side of integer on the another other side And so i'm going to make my calculation overall be a floating point calculation I can also pass into the float function I can say take this variable i which has a 42 also an integer and then give me back a floating point So that'll be 42.0 pass that into f. We print it out and it is indeed 42.0 And it's a float and so in it knows the type and value In any variable. This is an integer of value 42. This is a float of value 42.0 integer division in python 2 was kind of weird and it was actually one of the big things that they changed between python 2 and python 3 This is a python 3 course. So we're not worried about that too much What's nice about integer division in python 3 is it always produces a floating point result And that means that python 3's division is more predictable and it works more like a calculator So in this case I mean you can go back and look at my python 2 lectures and see how crazy it was in python 2 10 divided by 2 is 5.0 and the weird thing here is these are both integers But the division forces the result of the calculation to be a floating point number And this you know 10 over 2 could be 5 But 9 over 2 is 4.5 And so that is accurate in old python 2 that would give us back 4 which is Completely unpredictable and weird the same with 99 over 100 as you would expect if this were a calculator you get 0.99 Actually what you get in python 2 is 0 because it would round it down It doesn't I mean it doesn't round at all it truncates it So 99 over 100 is 0.99 and then it truncates it to 0. That's python 2 We're not talking about python 2. There's a good reason we're not talking about python 2 Welcome to python 3. Of course if there are a floating point on either side the result is still a floating point Floating point and the result is still a floating point. So integer division produces a floating result in python 3.0 Not in python 2.0 That is an improvement in python 3.0 And that's why we're recording these lectures I have a whole great set of lectures about python 2 and now i'm going to have a great set of lectures about python 3 Welcome to python 3 Okay, so we've been talking about converting from integer to floating point But you can also convert from string to integer or string to floating point And uh, so here we start out with a little string value Now it only works for strings that are made of digits So quote one two three quote is not an integer It is a three character string that has one two three as the characters in that string Which is very different than a hundred and twenty three We say what is the type of this? It's a string We say let's add one to it and it says can't convert into string So that blows up right because this is a string it looks to both sides string plus an integer not good, okay But we can convert this we can call the int function Which is like the float function and pass a string in so says hey take this and turn it into an integer So take the input of s val which is the string one two three and give me back an integer representation of that Which is going to be a hundred and twenty three So we say what kind of thing do we get back? Well, we got back an integer We can now add one to it and get a hundred and twenty four And so you have to manage the type of things and you can convert from one type to another Now int is not magic if you send something into it a string that has doesn't consist of digits Then you're going to end up with another error invalid literal for integer with base 10 blah blah blah blah So it's really complaining it says I want these to be numbers here and you just gave me letters So that's going to cause this to fail Another thing that we're going to do with variables is just like the print function takes something a list of things in this case A string comma a variable and then prints some output in the program The opposite of that is input actually input generally happens before output Input is a built-in function and we pass to it a prompt A string of text that's going to be printed out for the user and then it stops and waits So it says who are you and then right here it just sits Waiting for us to type something so we type blah blah blah blah and then hit the enter key Right, we hit the enter key and then this text Ends up in this variable. So this is an assignment statement that chuck is the result of the input call It's copied into the nam variable So let's do that again It's evaluating assignment statement remember it's kind of this way or you can think of it as do this just do this right side first It it writes this out Writes that out then it waits wait wait wait wait wait wait until we hit the enter And takes this chuck and that becomes the result of this input which is then assigned in to nam Now then we go sequentially to the next line. It prints out welcome comma Na contents of the variable nam now this one is comma here Actually does put the space in here automatically. So it says welcome space chuck So it pulls that there's no space in chuck just just the chuck And so print can take more than one thing separated by comms a matter of fact print can have You know a whole bunch oops come back and back and back Print can have comma comma comma parentheses as many as you like Everything you've seen up to now is kind of one thing in the print But that doesn't mean that print only can do one thing So i've talked about variables. We talked about constants. We talked about input. We talked about output and now it is time to write our first meaningful program and uh, so this program has to do with those of you who have Traveled internationally if you traveled to united states and you traveled outside the united states You notice that there is an elevator convention that is different inside the united states the united states the walk in the Brown floor in the elevator that's one and if you walk in the ground floor in europe or many other places in the world And the elevator is zero So we have written a small app that we're going to put on the app store and get wealthy with with called elevator floor conversion app And it it's going to ask us we're in europe and we're lost and you say well What floor would this be if i was in the united states of america? And so here's we have to read the floor that we are at at in europe And then we're going to convert it to a us floor and then we're going to print it out this is very silly but It is a Pure essential program that has input Does some kind of task on that input and then produces some output which is useful for some Value of useful. Okay, so let's take a look at how we combine everything that we've learned in this lecture input processing and output it's a three line program But it's sort of the beginning of something that programs do. Okay, you're going to do lots of programs that do this. So Here we go Program starts we do the input side effect. It prints out this and then weights We type in zero that comes back here and there's zero which is a string Input gives you back a string. It doesn't give you back a number It was a little different in python 2 but in python 3 input gives you a string So quote zero quote, which is what we typed here. We didn't type the quotes. It's a string get stored in the imp variable Then we move to the next statement And on this right hand side we convert that string variable to an integer. So that becomes the integer zero We add one to it and then that becomes one And then we assign that into usf. I've named this variable United States floor Right, so imp is the input and usf. That's mnemonic. It doesn't know anything about elevators It's just I picked a variable that was quite friendly And so at this point usf has the United States floor that's equivalent to the european floor and then I just fall down and I do a print statement print out us Floor us floor comma That's this space right here and then whatever's the contents of the us floor variable is And you could see that I could write this on four and it would say three I could write this in say seven And it would say six. This is an amazing program. It converts floors in the european numbering scheme Wait, actually, no, I got that wrong Hang on. Let me clear this I wasn't thinking clearly I could type in four and it would give me back five. I could type in six and it would give me back seven See i'm confused haven't been in europe in a couple of a couple of months And so I forgot all about the floors, but that's the idea Now this is a super super super simple program Not super useful, but you get the idea that we're going to pull some data in we're going to do some intelligent thing We soon this will be hundreds of lines of code instead of one line of code and then we're going to present the results to our user Now another element of most any programming language is what's called a comment A comment is a way for you to put in a program file Some text that's to be ignored by python or c or whatever language we happen to be using And in python comments start with a pound sign So what you could do is put a pound sign anywhere in a line and then after the pound sign Python ignores everything after that pound sign. It can be the first character. So here's our uh recurring Concept that we talk a lot about we're not going to cover this Remember what this does this is counting how many letters they're the the they're 16 those and they're In that file there were six twos or whatever it was. This is that code. We'll we'll get back to this code But what we've done here is I've added some comments that that that are really for human consumption So this first paragraph is get the name of the file and open it the second paragraph is count the word frequency You know, maybe I should have said histogram here Count the word frequency and assemble a histogram and then here I'm putting this pound sign in find the most common word And then I'm all done. I print the stuff out, right? And so all I'm saying is comments are for people to read Your next programmer or the person who's going to change your program after you're done with it Um, and they're nice and you don't have to use any particularly weird syntax or variable naming conventions You put a pound sign in and you can write anything you want from that point forward Okay, so we've talked a little bit about variables and types and mnemonics and how we would choose variable names and How expressions work and the various operators converting between different type types printing input output and comments so that just kind of gets us Sentences coming up next we'll talk about uh conditional execution where we're really starting to move up to paragraphs So see you in a bit Hello and welcome to chapter three conditional execution and conditional execution. We meet the if statement The if statement is where python can go one way or another way And it's the beginning of sort of our way of um Making python make decisions for us sequential code. We just you know do some things sometimes that's useful But now we can have our code Check something and then make a decision based on that thing So the conditional steps in python are pretty straight straightforward The keyword that we're going to use is the if statement and so if is a reserved word And um the if statement has as part of it a question that it asks and this is asking if x is less than 10 And the colon is the end of the if statement and then we begin an indented block of text And the way this works in this particular thing is this this line is the conditional line if The question is true the line executes and if the question is false the line is skipped And you can think of it the way this is right x is five ask a question Is it 10 or not these questions do not harm the value of x if it is then we run this code And then we sort of rejoin here and we then we test this next if and if that's true We do this code and then we do there But in this case it's going to be false because x is not less than 20 and so it just continues down here So if we look at how this works It runs it runs this line then it sees this question it skips that line So this line does not run and so smaller prints out and finni prints out Okay, and so that's the basic idea of an if statement and the indentation we when we are done with an if statement We de-indent back and there's this little block This is one sort of if statement and this is Another if statement and these are the two conditional lines that either run or they don't run depending on the question the answer to that question So we have a number of different comparison operators that we can use to ask these true false questions that say is this true So again, we're kind of limited to the key keys that were on computer keyboards in 1930 40s and 1950s less than less than or equal to so we don't we didn't have fancy math characters So we just concatenated less than equal to be less than or equal to this double equals is the the asking is this equal to And so that's a little tricky the equal sign is that assignment operator If I was building a language today from scratch I would probably make a sign at the arrow and the equals question to have An equals or I might say Somewhere I would say question equals, but I'm not writing this building this language. So that's it's not up to me. So this is the question double equals is asking The question is equal to greater than or equal Greater than and not equal. So this is the exclamation point is sort of like not equal So that that's sort of not equal. So that's how we do not equal So we take a look at some of these in some examples. All of these are going to be true because of the way x is set If x is equal to five, that's the question version. That's true or false It'll execute that if x is greater than four, it's going to execute that If x is greater than equal to five, it's going to execute that Here's kind of a shorthand where there's if there's only one line in this block You can kind of pull it up right on the same line after the equals if x is less than six Which it is true execute that Then if x is less than or equal to five do that and if x is not equal to six Do that now like I said all these questions have been carefully constructed so that they're true Just to kind of show you the syntax of those comparison operators Now you don't just have to have a single line of text in the indented block And this will be something you're going to get used to so If we indent more than one line then the indented the Conditional the conditional code is actually these three lines So the idea is you have an if statement you come in you do an indent And as long as you stay indented you stay in that if block if it's false it just skips all of those So the way this is going to execute x is five Your print before five is x equal five. That's the question mark and that's true So it's going to run all these And then come back and then continue on in the d indent. So all this stuff is running Right, and then it says f x equals six. Well, that was false. So that skips all of them. So none of these lines of code run so These actually don't run and it says afterwards six. So that's a mistake those don't run right there Okay, because x is not equal six Okay So indentation is an essential part of python We use indentation lots of programming languages often to kind of demark demarcate blocks to To show where blocks start and stop. Um, but in python, it's syntactically correct It is you can make an error if your indentation is wrong after an f you must indent And you maintain the indent as long as you want to to be in that same if block And then when you're done with the if block you reduce the indent In this rule of indenting comment lines and blank lines are are completely ignored So we're we're going to tend to like put four spaces four spaces ends up being Four spaces ends up being the the normal thing that we do and you'll see all the code that I write Has four spaces for each indent if I go in twice I use eight spaces Um, and we have this instinct of wanting to hit the tab key to move in four spaces Now the problem is is that it might look the same on your screen A tab and and four spaces might line up the same place depending on how tabs are set But python can get confused by that so we we tend to Avoid using actual tabs in files And so most programming text editors like if you're using notepad or text wrangler There's a place To set the tabs to say don't put tabs in this document, but every time hit tab move over four spaces And so if you hit a tab, but it's like space space space space space Now the nice thing about adam and this is uh the the text editor We tend to recommend in this class a because it works on windows linux and mac But also because it automatically sets this up as soon as you save your file with a dot py extension You can sort of hit the tab key with impunity and everything Works perfectly But the key thing here is that python insists that you get this right and if you don't get this right You're going to get indentation errors and they're just another they're just another syntax error So if you're using something like text wrangler or notepad Run around in the preferences and you'll find something about expanding tabs or maybe How many spaces each tab stop is supposed to be and so you check these and what this really is doing Is telling your text editor never put an actual tab in the document, but somehow simulate tab stops using spaces And so here is Oh a bit of code. It's got some nested it's got nested block But it gives you the sense that you have to be very explicit when you're reading python code of whether The indent is the same between two lines the same increased or decreased And and so you've got and when every time you increase it you mean something and every time you decrease it You mean something and literally if it stays the same you mean something as well And so if we take a look at this That here we have a line and it has the next line has the same indent This is an if with a colon at the end So we have to increase the indent and now we're maintaining it Okay, so these two lines are part of that if but now we have de-indent it So whether you choose to de-indent this word or this word or whatever The where you do this de-indent affects the scope of how far this if statement lasts, right? It lasts up to but not including the line that's de-indented to the same level as the if Okay, so this is a de-indent now We have a blank line which doesn't matter and we maintain it Now we have a four which we'll learn about in the next chapter, which is a looping structure This do a four four runs this five times It has a colon and it also expects an indented block Now we have what's called a nested block where we have an if and a colon we go into some more So this is like two indents, right? So these are one indent and these are two indents and so this is a block within a block And then we de-indent so that means this print is not part of the if statement But it's still part of the for statement and then we de-indent again And then that means this print is so that on the same level as that for statement So if you start thinking about this You want to be able to start thinking that these blocks are the start of the block with the colon line Up to where then up to but not including this line that's been de-indented So the four goes this far, right? The four goes up to but not including the line that's de-indented the if goes up to but not including The line that's de-indented So as you do this You'll sort of mentally start drawing these blocks and pretty soon He will start constructing them as blocks and it it takes a while But doesn't take forever but in python unlike other languages oops In python unlike other languages you have This this is very important and it matters And you can have syntax errors if you get it wrong because you're really communicating The shape and structure of your code using these indents and de-indents We already saw a nested indent. This is a nested if So you can put an if within an if and you can go as far deep as you want to go like russian dolls And so here we have x equals 42 if it's one we indent one and then with this next thing we do These are not the same level of indent but now we see an if and it has to indent further So this is like two in eight spaces and then Then we de-indent back actually de-indent back to and so if you'll watch this and you take a look at how this works It runs to here. Oops Back up Comes in here. The answer is yes x is greater than one Prince this is x less than 100. Well, it's 42. So the answer is yes So it runs this and then it kind of continues back to there And you can also think of drawing boxes around this. This is one if box And then within that if box there is another if box And again, it's the indent the indent block up to but not including where the de-indent happens And this here is like two Backwards de-indents So it ends two blocks. So two blocks are ended by where we place this we could move this in Or we could move this out. We could have it all the way into here We could have it to here or here and where we put that line depends on how the ends of these blocks are going to work out So one form of a That's a one branch if that we just show we just saw but then you can also have what's called a two branch it And the basic idea of a two branch if is that you're going to come in You're going to ask a question and you're going to go one direction if it's yes and another direction if it's no We call this an if then else It's kind of like a fork in the road and the way to think about it is depending on the output of this question We're going to pick one or two of these but if we pick one the other one's never going to happen So it's like an either or we're either going to go one way or we're going to go the other way But there is no path where we somehow go boot through both of them that that doesn't happen And the synx the syntax that we use for this is the what we call the if then else And so the first part Is normal if with an indent And then we de-indent and then this is another reserve word else with a colon and then we re-indent And so this is really end up being part of a whole block here And the else is the part this this is the part that runs if it's false And this is the part that runs if it's true the first branch of the if The first indented block is what runs if it's true and the second indented block is the one that runs if it's false And so here we go just if x is greater than two in this case It's yes, we're going to print bigger and they're going to be all done And so we do one and so this one did run and this one did not run So basically with an if then else one of the two branches is going to run But there's no case in which both branches run And again, you sort of draw these blocks around these things mentally And in this one you sort of take from the if not the else is really part of the block Up to but not including that print which is back indented de-indented back to the same level as the if statement okay Here's this python is actually One of the more elegant languages even though after a while this indenting and when you get too far in it gets a little bit complex But uh, but this is a good way to visualize this with these indents Coming up next we're going to talk about some more complex conditional structures So welcome back. Let's talk a little bit more about some more complex conditional statements that sort of build on this concept of if and if then else The first thing we're going to look at is the the multi-way branch And so the idea is it's kind of like the if then else where you're going to pick one of two But now we can pick one of three or one of four or one of five And it introduces a new concept called the elif the elif is another reserved word inside python And the way it works is it it's probably best to look at this here where it It checks the first one and if it's a true then it runs that and then it's done It doesn't check them all. It's not like it sees that there are two Logical conditions it actually checks them the first one and how you order these matters as we'll see in a bit And so if the first one is true it runs if the If the first one is false and the second one is true it runs this one and it's done and if the Neither of them are true it falls through and there's an else clause that is Otherwise and it runs that so so basically it's either going it's going to run one and then skip the other two Or it is going to you know Skip skip one skip two and then run this one But it only runs in this case one of them But the important thing is it checks these questions in order And it doesn't check the second question until it finds it the first it doesn't check the second question Until it knows the first question is false. So if the first question is true, you're done You're done and you're done with this you're done with the whole block at that point So only one of these three is going to execute in that block So here's sort of some examples of this if we for example have x equals zero It's going to come down here x is less than true. That's true So it runs this code and then it's skip skip skip down to the that and so it's like this Runs that code and then skips to the end Okay on the other hand if It's five Then this is false and it skips that and it checks this this is true It runs this code and then it's done skips to the end goes like false true run end And then if x is like 20 for example, it runs it runs false Falls run the else clause and you're done. So skip Skip oh else run that code and you're done. So in this case, we ran that and we didn't run that and we didn't run that again One of them is going to run they're checked in order these questions are checked in order not out of order I mean it doesn't look ahead. It just checks in the order that you wrote it You you're the one that wrote that order And so there's a couple of variations on this multi way You can have no else You can have no else as in this case And this just means that it might not run any of them in this case x is five So it's it's not less than two, but then it runs this one But if x was like 50 for example If x was 50 Then this would be false They don't skip and this would still be false and it would skip and neither of these two would run So if you don't have an else you're not guaranteed that one of them is going to run because else is like the catchall If the other ones are all false then the else is the one that runs similarly You can have many elfs, but this is where it's really important for you to make sure you know what order they're being taken in so that I've got you know this If this is true it runs it goes all the way to the bottom if you know this was if If it's false false false true it runs this one and it's done If on the other hand it looks at it as false oops go back go back If it runs false false false false, they're all false Then it runs the else right this one has an else this one didn't have an else They don't have to have them the key is you can have more than one of these ls Okay, so I got a couple little things. I'll let you pause right now and look at the question is are they're looking at the Three lines or four lines of code X equals something Are there lines of code that will never value execute regardless of the value for x and I'll let you pause And think about it and then I'll explain it to you Okay, hopefully you paused and thought about it as long as you like, but so let me now explain it to you so We come in here And if x is less than or equal to two it's going to run this first thing And if x is greater than equal to two it's going to run this and if neither those are true Then it's going to run this well the weird thing is There's no all numbers are either less than two or greater than or equal to two I carefully constructed this to the point where it would never run this line of code It is either going to run this one or run that one But it's not going to ever run this one because that was kind of like a weird dysfunctional one that I constructed This other one is a little different If x is less than two we do this if x is less than 20 we do that If x is less than 10 we do that and if none of those are true We do that well the problem here is between these two lines The problem is if something's less than 10 like six for example It's also less than 20 So even though this there might be values for which this is true Those also are going to have this true. So for something like six, it's going to run here And it's not even going to look at this. That's the point. It doesn't even look at this And so that's I mean I could have made this more sensible if I had to move this little block of code Up to there and so this is where the order in which you choose your questions The the way you put these ellipses together matters because it doesn't look at all of them It only looks as long as it can as long as it sees falses then it keeps on going to the next one But as soon as it doesn't see a false it doesn't continue So the last conditional structure We'll talk about is the try and accept structure and if you look if you know any other languages like c++ or java or javascript You're like, whoa, that's kind of an advanced concept and it but it turns out in python because of python's propensity to throw tracebacks In situations where you kind of would like to recover it It turns out you kind of have to use it a little more and a little earlier in your programming skill So the problem is is what if you there is a line of code and you absolutely know it's going to make a traceback It's going to blow up But you don't want to blow up. I mean, I don't want to blow have code blow up If you're using my auto grader and you see a traceback in my auto grader, that's kind of like I consider that a failure I could put an error like hey, you entered blank data or you didn't enter a number But a traceback that just seems like i'm too lazy as a programmer So we as programmers are supposed to anticipate parts of our code That are going to blow up potentially based on perhaps the user's input and then do something about it And that's what the try and accept are for It's you take this little dangerous piece of code that might break and might blow up And you surround it with a try and says this might blow up and if it fails run this code down here Okay, so that's the try and if you get an exception the accept is kind of like if you get an exception And the problem is is if you are running code Here's a little bit of code We we put hello bob in and we convert it to an integer and we know From past experience that this blows up Right, you can't take hello bob and convert it to an integer. It's just going to blow up The problem is is and you know here we are it says oh you blew up online too. That's great And i'm not very happy with hello bob and and whatever but the important thing is Your program stops There's these these other lines. They don't eat. Oops these other lines. They don't exist Right, it doesn't go any further it it remember the traceback is i'm python is really confused And i don't know what to do next so python is just going to be conservative and stop So python stops and your program stops no matter how much error checking you put down here It doesn't matter because it's gone. It's all gone And like i said, we we take this kind of personally because the code that you write is like the You know you being put into the computer giving it instructions and if the code blows up Well, that sort of wipes you out. You're not in the game anymore. You're not able to do anything So we want to be able to Especially in these situations where we can anticipate that a an error that might happen in the normal course or your program's execution Might be something that you want to compensate for and that's what the try and accept does So here's a bit of code for the try and accept And we just have two little bits of a straight line code And so we put a string in here that's hello bob and then we're going to convert it to an integer This is the dangerous code this code in this case with hello bob is going to do a traceback And so we say try and then we indent the dangerous code and then we put add this little accept a bit If it works the accept is ignored if this blows up it runs the accept So in this code, it's going to come in it's going to try the it's going to try this This is going to blow up But instead of giving a traceback it's going to say oh, I've got an available accept I'm going to run this accept code and then I'm going to continue on And so that prints out first negative one So because we set this variable ister to negative one like a little flag telling us that something went wrong And then we keep on going and now we have put in one two three the digits 123 You know the digits one two three and now it's going to work But we still have it in a try block and then this one works It does not blow up and then it ignores the accept block So the accept block is only triggered when something goes wrong in the code It is ignored if something doesn't go wrong So it's like you bought an insurance policy on this line of code And when things go wrong your accept block springs into action And does whatever it is that you want it to do in the case of an error Okay, so that's a pretty useful thing You got to be a little bit careful that you don't overuse it because if you put more than one line inside the try part And it one of the lines blows up it doesn't come back to the try block And so in this one in this one here, we have kind of a simple silly one where we set the string We're worried about some stuff. Well the print statement is never going to blow up So it's a bad idea to put it in try accept anyways Then we do this conversion and that's the dangerous part And in this one it's going to blow up And and so then it's going to go to the accept block and then run the accept block and then continue What it does not do what it doesn't do is somehow go back and finish this So these lines are gone. So if you look at it like this, this works the try starts Hello This blows up it goes to the accept it runs the accept and it continues on never runs that code So it's not like you took out an insurance on the whole block Any of those lines can blow up in the block, but whichever line blows up That is the last line that's executing in that block Okay, so you tend to want in this particular example You would probably you know the print statement would go out there and this print state would come down here And you would only put in your tri block the single line of code that you think might blow up Because you kind of know print statements aren't going to blow up So this is an example, uh, that's a more common real world example Where the user is going to type some data And that's users that get us in trouble. So our program starts by asking the user enter a number And we know that this could be dangerous. So we're going to do it We're going to put the the conversion from string to integer in a tri block And we're going to set negative one if that's a failure And then if it's if it's greater than zero we'll say nice work and if it's less than zero We're not a number. So first time we run this program outcomes enter a number We type in 42, which is a string that 42 goes back into roster runs in here This runs it's fine. That becomes a 42 number. So we skip the accept block And i val is greater than zero. We print out nice work and we skip the else Okay, so it says nice work On the other hand If we run it again this time The input says enter a number and we be we're silly. We enter the word 42, but in words 40 f o u r t y So that's a string and that goes into roster and then the execution continues We run in here and now this is going to blow up That's going to blow up. Normally we would see a trace back right there There'd be a trace back But we're not going to because we put this calculation in a try and accept block It's going to immediately run the accept block set i val to negative one Continue on with the program. See you are not blown up at this point And if i val is greater than zero, well, it's negative one So we're going to hit the else clause and print out not a number So we've done error detection The user set something that caused a line of our code to kind of blow up But we put that line in a try and accept block and so we caught it and so we we dealt with that fact So in summary in this we talked about if statements. We talked about else. We talked about trying to accept How important indentation is to to mark blocks where they begin in the end? Um, and an else if and try accept so Up next we're going to talk about uh loops and iteration Hello and welcome to chapter four functions. This is uh the fourth of our basic patterns We'll get to iterations next functions is the store and reuse one of the things In programming is that we never like to repeat ourselves We don't like to if we have four or five lines of code and we're going to do the same thing later We don't like to put the same four lines of code in um, even if It has to do with reliability If you find something wrong with those four lines of code and you've got them 12 different places in your program Then you got to find all 12 places and fix them So like collect those to one place and then call them and reuse them and that's the idea of store and reuse Um, so this is what how how functions work inside of python The first thing we notice is there is a new keyword def that stands for define function And the def is like an if statement or we'll see fours and wiles that they end in a Colon and then they have an indented block and then the indented block de indents and that's the end of the function And so so there's these there's two statements make up this function um, the key thing that you have to Understand and get used to is this is this def part is actually not running any code whatsoever It's actually remembering the code and that's what I call the store phase the def creates a bit of code and records it Like a macro, although it's much more complex than a macro and it names it whatever you chose You gave it a name. We named this one thing and so it as a side effect of Python reading or parsing these three lines It doesn't do anything, but it remembers these two lines are what you would like to run when you invoke thing So this is the definition of a function and this is the invoking of the function, but So let's so this doesn't do anything So there's no output here from that stuff right there But then what happens is you invoke it and this thing looks like it's part of python But you an effective extended python with your def statement And so when it sees thing it goes up and runs your code And so out comes hello fun And then you it comes back and goes to the next line Does print so print comes out and then it goes back and like oh, this is the reuse part But we get to reuse it we define it once and we use it twice Then it runs this code again and then goes to the next line and it's all done So this little bit came out twice and of course This is really simple so that I can fit it on a page But you get the idea that I don't want to repeat this might be you know 15 to 100 lines of code and I don't want to type those over and over again So I say hey store these in a name what I that I choose And then when I invoke them bring them back and then run them again. Okay, so that's the basic idea We actually have already been using functions from the beginning the print is a function right print is a function Every time we see print p r i n t parentheses and then we have some stuff in here We are calling the print function. This is the syntax with two little parentheses is the syntax for functions And so inputs a function type is a function floats a function ends a function All these things are built-in functions that come with python at the moment that we We started I mean it's just we installed python and these came along and And then there's other functions that we define and use and that's what the def is for And in effect we can create new reserved words of our own making that extend the python language After the after we define the function So it's just this bit of reusable code that takes some arguments. We haven't seen any with arguments There's a little parentheses and we'll see how that works in a bit We define using the def keyword and then we invoke it We there's the defining phase which actually doesn't run the code It just remembers the code and then there's the invoking phase You define it once and then invoke it one or more times Calling the function or invoking the function. We think of those two things as the same thing call Invoke or just the terms we use Most people just say call the function, but invoking it is a perhaps more descriptive way to think about it So here's an example of a function It is built into python. It's called the max function And we can pass some parameters into the max function. So we pass the hello world string Now like much of python max knows It's what kind of thing is being passed into it and it knows that it's looking for the largest character the high of this the um The lexigraphically largest character and in this case it scans this little that's inside the max code It scans through and finds the largest character So apparently lower case letters are higher than uppercase letters because in english We get back a w and so this is what's called the return value So this is an assignment statement Let me clear this and start over So this is an assignment statement So it has to evaluate this right hand side and a function call is nothing more than like x plus one It's something to evaluate It runs the function code passes in this argument and then this residual value this is called return value We'll look at this in more detail Becomes the result of this little bit in the expression and there's nothing else We could have you know w plus one or something and then the w is what's stored into big Okay, so we print big and big is a variable that has the letter w inside of it And then we ask what is the smallest and that finds the blank and so we get a blank to see this There's a min function and a max function. Both of these are built in These are built in functions. They're always there for us Okay So here is another example of the max function and so we can think of this as Invoking or calling this function as this right hand side is being evaluated We are passing this variable in and there's some code in here And it's going to do some stuff yada yada yada and then it's going to give us back A bit of stuff and that's its return value and then that goes up into the big right And so that's that's how this works. And so this is actually built in Built in or burnt in I guess I can't draw and so you can think of this as some time a long time ago When python was being first formed somebody wrote some code And it's got some stuff in it. It's got a little loop that reads through all the Read through all the letters. It has to figure out if it's a string or a list, etc, etc, etc But this is store Except you didn't do the storing because it's already built in and then this is the reuse store and reuse So we build these things into python. They're already pre-built as if before the first line of your code executes way up here Someone put all this code in for you into python and create a thing called max for you Now we've been using this already built in functions. We've got type conversions We've got like the float that takes a integer and returns A floating point version of that and again, this is kind of like an expression So it's like I want to divide this by a hundred but before I do that I've got to convert it to a float so it has to sort of do these function calls as it's evaluating the expression okay Sometimes like here, we just have We just have a prints out the return value. That's what this is. This is the return value If you just type a function in a parameter Uh, it can be in a constant or it can be a variable and as we'll see in a second We'll give you many of these if you like So you can either just run it or take the result of this This passes an integer in converts it to a float and then puts the float into that Type tells us what kind of thing that is and you can use this inside of an expression And so it's like what am I going to do first? Oh, I've got to do two times this thing Oh, wait a sec pause just briefly for a moment to call out to some float code Pass a three into it and then something comes back the return value The residual value comes back and then that participates in this case. It's going to be 3.0 Participates in this two times 3.0 Okay, and so two times 3.0 then so being six a point out, etc But you can see as it it's like, oh wait a sec I got to figure out what this is call the function get the return value and then continue processing this expression We've also done this with string conversions partly because Just as an example the input always returns a string the input function returns a string And so, you know, here's this string could be coming from input, but we'll just take one two three We know that that's a string. It's not the number 123 And if we try to add one to it, we get a trace back Can it concatenate string and integer? Trace back But we can convert that string to an integer and so int can take like a floating point number or an integer Um, or even a string and it says, oh, I know what I'm supposed to do with a string I'm supposed to look at this interpret these as numbers and you know Multiply by 10 and figure out what the hundreds places and all that stuff There's a little bit work to that and it does it but then it gives us back an integer And we say, oh, what is that? That's now the 123, but it is a type int and now we can add one to it and get 124 And as before from this example that we're kind of reusing from a previous chapter, uh, you don't want to Try to convert oops sad face sad face Sad face Don't want to try to convert something that doesn't have digits using int because it'll say I don't know what to do And then your program quits, right? You don't want your program to stop trace backs and you can of course deal with that with try and accept But that's like a previous lecture Okay, so up next we're going to talk about building our own functions not just using the predefined ones So welcome back. We're going to continue and start talking about building our own functions So Again, we use the def keyword to define a function and then later we're going to invoke this And there's a bit to it. We are defining the name of the function in effect We're extending python and creating new predefined things that we can use except it's our code It starts with a def keyword has some optional arguments, which we'll see in a bit That's what the parentheses is and then the name and the function names follow the same Rules as variable names And then you have an indented block Whatever code you want to do and then you have a de indented block and that sort of defines The essence the key thing here is this is not Calling It's not invoking. It's not executing. It's remembering. It's storing. It's figuring things out So here is the output of a program that defines a function, but then doesn't use it So this is a sort of broken function. So here we go. We start x equals five print You don't have to def you have all the defs at the beginning the def runs whenever So, you know out comes hello and then we define a function and this says oh Oh, you want to make a new thing here? So I'll make a new thing It's kind of like a variable in a sense and then it copies this stuff Copies it up there and says later you probably you're going to want to use this so I'm going to remember it So it doesn't do anything there. It no output comes out Then it says print yo and out comes yo And then it adds two to x so x is now seven and then it prints x and there's no seven there's seven These print statements never ran They never ran why because we did not invoke them down here. We didn't we defined them But didn't invoke them. So let's take a look at how you invoke a function, right? You define it and then you use it sometimes you define it once and use it once but More commonly you define it once and use more than one time Again, the store and reuse pattern the def is the store and the invoking is the reuse So here's just a slightly different version of that last program and so now it's going to actually invoke it Um, so x equals five print. Hello def def so outcomes. Hello This produces the def produces no output, right? But because there's a de indent here that is the entire blob of the of the code that is part of print lyrics So it prints out yo And now we're going to invoke. This is the call We're going to call the function now the function goes up Let's clear this So we're down to here Now this this this like suspends at this place It's like remember to come back to here when we're done go up run this code And then come back And then continue on so it like leaves like a bread crumb of where it's supposed to come back to And then it runs and then the print lyrics of course produces The two lines of output And um, yeah, that should probably not have that day should be up there And then x equals x plus two which makes it seven and then prints out seven. Okay, so this is the invoking Invoke or call the function Okay, you defined it and then later you called it now In addition to just call and return and invoking We can pass parameters in and the example of the parameter is In the max function, we have to say this is the thing. I want you to find the maximum about the largest thing And and part of it is in the whole store and reuse pattern We have a few lines of code, but sometimes we want to do ever so slightly different things In the different invocations and so we use the arguments to to subtly adjust like Finding the maximum is a general thing But what thing to find the maximum of that makes a function that's much more useful and reusable in a lot more situations So arguments are the thing we passed in and we define for our functions that we're going to build We on the def statement So we say def greet name a function and then this is the arguments the things that are coming in um Now this lang variable in a sense only exists during the life of the function and it represents sort of a placeholder It's not a real variable in the same sense It's a placeholder that refers to how you touch that first parameter that's sitting in there Okay, and so lang So lang is our first parameter whatever it is We don't we don't need to see this part down here right now All we know is we're going to make a function and we're going to take a first We're going to take a parameter and this lang is the placeholder that tells us what that parameter is Okay, so within the function we're going to check to see if the language is spanish If we are print hello else if the language is front print print french print bonjour Otherwise print hello we have a very highly simplified language translation system here So the def of course does nothing except remembers that and defines the concept greet So that comes down and now we're going to call it and that says go look up the thing that I define called greet If you don't put this in greet is going to give you a trace back But because you extend it and named it greet. So it runs in it starts suspends the code here starts up here, but then Lang is now an alias to e n So now we can run if that is e s Oh you else if I'm getting it all wrong now Right. So e n comes in as lang we're coming in the code If it's it's not es it's not fr else and prints hello and then it comes back to the next line And then we call it again and this time es is laying and so it runs this code and prints Hola and then next time it calls with this and then prints Bonjour you get the idea so this is a placeholder To so that on the successive calls or invokes invocating invocation of the function We can get at whatever the programmer put in as that first parameter And so we are saying in this definition. We are ready to receive a first parameter Please call us with a parameter and then we will be able to do something slightly different for the different values So this is a reusable bit of function that prints hello in three different languages And then we tell it what language at the moment that we're actually invoking it So that's putting stuff into the function now getting stuff back out is the concept of returning in the return statement The return statement Is an executable statement that does two basic things the first thing that it does Is it finishes now? This is a one line function. So that's kind of redundant But it if when python goes into the return statement, it doesn't continue on to the next line It just returns that is the end of the invocation of that particular function But even more importantly it takes as its parameter you can say return without a parameter And it will stop the execution of the function kind of like a break does for a loop It's kind of a break for a loop get out. We're done. Don't run that next line get out But it also allows the specification of what you want as the residual value in an expression So we're doing a print and then we're saying greet And and what's going to show up here is whatever this function does in its return statement And so that prints hello. We call it again and prints hello again Okay, and so And so basically the return statement Is there I call this the residual value. It's like what shows up here when the function is all done and it's the string hello We call the functions that return value is fruitful because they produce something and But you don't have to you can just say return Or you don't even have to have a return statement It goes to the last line of the function and it does a return automatically at the last line of the function So here's a little bit of a rewrite of our little language program We are going to create a greeting program. We're going to take the language as the first parameter And instead of just doing a print statement, which is what we did before this is now more More like a function because it takes some input and produces some Output it as a return rather than just printing. It's a little tacky for a function to print And so here we return hola bonjour and hello based on the right thing So now we say print greet En so it runs the code once laying as en and then It runs this code and the residual value is hello. So it says hello glenn and similarly When it runs this code it's passes es in is laying it runs through and it runs this statement It does if there was more statements, it still wouldn't run them as soon as this return runs That says that this bit right here is Now hola. So hola sally and the same with french goes in runs again Outcomes the return statement and then bonjour michael So you see how we can control as we're writing the application We can control as we're writing the function what the residual value that we want to see In whatever expression is calling us. Sometimes we have returns and sometimes we don't have returns So So if you think of the max code that we talked about before we can kind of see that somewhere inside that max code There's a return and that's how it communicates the w back to us. So we pass in his argument. Hello world It comes in as a parameter and it's going to loop through this imp somewhere It's going to loop over and over into imp and then at some point It's going to figure something out and tell us what it wants to send back to us Is a return statement and so the w comes back and gets assigned into big You can have more than one parameter And there's just an order the first one in the second one three and five so three becomes a and five becomes b And away we go. So we just use this to add two numbers and so three plus five is eight So again as many as you like and the order matters and and if you do things like you tell it You want a parameters and you don't give it to them then that'll become a trace back and it will blow up You can also talk about optional parameters if later So you don't have to have return values and that means that you simply uh, don't call the return with the value and return is always Implicitly happening as the last line of the function So that's that's kind of the basics of how functions operate But I don't want you to get too excited about writing functions Some programming classes are like got to write a function got to write a function Functions to be clear are a very powerful mechanism. And as we write programs 150 200 200 lines of code thousand lines of code 10 thousand lines of code The concept of a function is really important We would go crazy if we didn't have functions But if you're only writing 20 lines of code Forcing yourself to write a function is kind of pointless. So don't worry about the Maybe the lack of urge to use this We are calling lots of predefined functions and we will for the next couple of lectures There will be a time when you go like, oh, I'm sick and tired of repeating myself. Oh, yeah time to write a function So that's that's why we don't push functions prematurely We just want you to know what they are Use them and at some moment you be like, oh, I want to define one But don't worry about it might take a while before you really want to define a function So that kind of summarizes our lecture on functions and up next we're going to do iterations Hello and welcome to chapter five loops and iteration Now we're going to work on our fourth basic pattern on sequential conditional Store and reuse and loops and iteration and this is the one where we Teach the computer how to do things a lot We can tell to do something a million times and so that's where We get the the doggedness of computers or the fact that they're so good at doing work for us because we can set them off to a task and They'll do it until it's done So here's a very simple loop A very simple loop Let's put the coffee over here The keyword that we're the that we're going to start using is the while loop We're also going to use the for later on And the while loop is functions very much like an if statement Uh, the while starts it and then this is just like an if statement It's a question that leads to a true or a false answer And then there's a colon and then there's an indented block and then we use the de-indent to determine how long the loop is And so this print is de-indented. So that indicates the end of the loop And so at some level, uh, what's going on what's going to happen here is it's just going to run and if this is true It's going to run this code And if it's false it's going to skip the code and that way it functions like an if the place that it doesn't function Like an if is after it's run the code once It goes up and then asks the question again And so you can think of it going back up Kind of to the top of the while loop and then re-asking the question like okay Is this going to run again? And then it's going to do that some number of times and then it's going to finish and so that's the loop that's the iteration and We're going to make a variable we're going to construct very carefully a variable that we call the iteration variable And that's n and it's a variable that's going to change and it's our way of running the loop but not running the loop forever So, uh, let's just run this we come in and it's five is n greater than zero. Yes, it is. So we're going to run this code So we're going to run this code. We're going to print out five now We're going to subtract one and then we're going to go back up Go back up and ask the question is n greater than zero and the answer is since it's four the answer is yes So and it runs again Then it prints out four subtracts it again checks prints three subtracts it again Prince two subtracts it again Prince one subtracts it again now n is zero and so it comes back up Comes back up is this question has now become false So it's going to take the exit So it's going to come down and run this line right here Then it prints blast off and we can kind of print out the residual value then just to sort of prove to ourselves that It ran until n was no longer greater than zero and then zero was the final value for n and we Carefully constructed this n n equals boots go back We carefully constructed n we set it to five then we carefully subtracted one each time through the loop And then we're using that to control when to exit the loop And so you could think of this loop as for now running Five times true true true true true and then false finally So this question was true for a while and as long as it was true The loop ran and then when it finished when it turned false the loop stopped And so this variable that we construct to control the loop was called the iteration variable because it tells how many times This loop is going to run over and over or otherwise known as iterate So this is a badly constructed loop with an iteration variable that we didn't do very well And so if we take a look at this We start it with n five and then this is greater than zero So it's true so it runs it and then it runs it again and then is still greater than zero So you can pretty much see because we're not changing in this is going to be true true true true Forever true forever. And so this is an infinite loop And uh, it's just going to run until your computer runs out of battery or you hit the button This is the kind of thing where you often see your uh, your computer spinning Like a spinning beach ball or some other indication that your computer's super busy It's in some kind of a loop really tight and it's running something and and it's using up all of the processing resources of your computer That's an infinite loop And so the problem is is we did nothing with the iteration variable Now here's a different loop and so this one Demonstrates a different idea. So in this case we start out with n is zero and it comes in here And is n greater than zero question mark and the answer is false So it skips it it doesn't run these lines of code at all And so this loop doesn't run at all because it comes in ask the question It says no and then it skips right around it So never run never run and so this actually is sometimes you write a while loop On purpose like this not quite as simple as this one But the idea is is this is this emphasizes that these loops are what we call zero trip They are not even guaranteed to one run run once they're they're going to run Maybe zero times and in this respect it functions exactly like an if statement Right being the first time through the loop if it's not true. It's just going to skip right by it So there's a couple of ways of getting out of loops in this case. I'm constructing an infinite loop because Remember the kind of definition of an infinite loop is if this is going to stay true Well true is the constant true So this is going to run forever and what it's going to do is it's going to prompt with a little Little arrow and then let us type and read whatever we type into the variable line And then if the line is done we're going to break now break is an executable statement And if you hit the break it exit exits the innermost loop out to the to the place beyond the the end of the loop So When this runs the first time And we say hello there Line is not done so it prints it so it prints out hello there and then goes up And then we type in again. We type finished and so it doesn't it's not done So it prints it so now comes that print statement Then we type in done and now this becomes true And it comes out and runs the code beyond the end of the loop The key is is it doesn't go back It's like once you've done a break that loop is done And so you so you look at basically You know the block that is the loop. So here's kind of the loop block And then the break goes to the line after the end of the loop block And you can think of this as sort of like just a hyperspace jump There is nothing really this could be literally hundreds of lines with if statements And you could be running and doing all kinds of stuff and running or doing all these things You know and these things could run all kinds of ways, right The point is is as soon as you hit a break statement however much stuff is down here However much stuff is up here it exits to whatever the next line is beyond the end of the loop Continue is another loop control statement, but it works differently than break So break says get out of this loop Continue effectively says Stop this iteration we're done with this iteration And so continue says go up back to the top of the loop. Oops. Yeah go up back to the top of the loop And so here we read a line if the first character is a pound sign Line sub zero if that first character is pound sign We're going to skip it And this is a way for us to make like little comments in our typing And then we print if the line is done we get out and otherwise we print it And so that's why there is no print out here because it comes in Runs oops It comes in This is true and that goes back up, but it comes back And prints out the next one and does another thing and so the loop continues Whereas the break ends the loop And so again the same kind of notion that you're sort of doing all kinds of complexity Wherever you're at in this loop You hit continue and it does not it doesn't go any further It goes back up and runs the question mark it asks the question mark And and so I mean ask the question and it might exit the loop in that particular case But this one here is a true This is an infinite loop that I've constructed This is not an infinite loop because at some point the break gets us out of the loop And so it's an infinite loop with break to escape it And that's another a common way to construct a loop So these loops that we've been drawing so far the ones that use while as their key keyword Are what are called indefinite loops And that's because they kind of go for a while till a break hits or until some value becomes true I mean until that as long as that value remains true So when we all the ones we've done so far easy to look at and know that They look pretty good and they're probably going to finish But there are sometimes if they're long and complex and and their exit or termination conditions Are a little more complex. We're not it's not clear that they're really going to terminate And so we we can use while loops for a lot of things but For most of our looping we're going to use what are called definite loops and that's what we're going to talk about next So definite loops use the for keyword and the idea of a definite loop is it's going to loop through some set of things It might be a set of lines in a file It might be a set of characters in a string. It might be a set of strings in a list of strings But whatever it is. It's sort of going to run A finite number of times depending on the thing that it's looping through and we like this and It it's an easier way to construct it And we actually don't have to deal with the iteration variable the for loop includes a mechanism to construct the iteration variable for us So it's definite loops iterate through the members of a set. So here's a very simple for loop And so you see the for keyword And in is also a keyword And the iteration variable is something we put right here this i is declared this i is like an assignment statement And i is going to take on successive values So i is going to be five the first time through the loop then i is going to be four the second time through the loop third To one so i is going to be assigned five different times to five different values And then the loop is going to run It's going to run once with five once with four once with three Once with two and once with one and so this block of code We have contracted say execute it five times with these values of i i is that iteration variable i is the thing changing through each iteration of the loop okay, and so That's why this prints out five four three two one and then when it's done it finishes it So this is a much more direct syntax for looping five times and setting an iteration variable you kind of all Combine it into this one thing right all into one thing. So it's it's quite nice So you don't have to be going through a list of numbers There's all kinds of things that we can iterate through with four and by the way while i'm sitting here don't I named my variable friends because that's a list of strings and friend Which is the iteration variable i'm using singular and plural because it helps you read it python doesn't understand singular and plural So just because you say friends doesn't mean python knows it's a list python does know it's a list But it doesn't know by the name of the variable i've chosen That's your basic mnemonic variable warning these are cool variable names But I don't want you to get confused by them So you can loop through a variable So we're going to take this list of three strings and stick it in friends And so friend is going to iterate through that so the first time through friend is going to be joseph Second time through it's going to be glenn third time through it's going to be sal And so that just says run this loop run this code the indented code three times Each time the variable friend takes on a successive version of a successive value That's in the friends array So it says happy birthday joseph glenn sally and then we come out of the loop and we print done so If we try to draw a picture of what this is really doing Um The for loop is actually doing a whole bunch of stuff that we would have to do With maybe separate statements in the while loop. Um First it decides how many times to run the loop. So it's answering the done question. Which way do we go? And it is also then moving eye ahead. It's managing the iteration variable if you go back to The it's initializing it too if you go back to the while loop We had n equals zero while n greater than zero n equals n minus one So we had like three lines to control the loop to manage the duration variable Oh, the for loop. We don't have to do that. And so that's all taken care of and so that basically says You know The for loop by you using a for loop. Are we done? No, we have five things to work Well set out of the first one run it We're not done because we got one more set it to the second one third one fourth one fifth one and now we're done And that is all handled in a single line of code And that includes the duration variable and the set of things through which we are going to iterate through I really like the word in um, it is Mathematically, I mean it reminds me of The set theory where you say this is a member of this set or the for each Math isn't important here, but if you do know math the vertical bar means such that Right as a member of this set and those that kind of stuff member of the set I'll erase the math stuff so we don't over math, but it's like For each of the values in the set five four three two one Run this loop setting the iteration variable i to the members of that set so in reminds me For those of us who are math oriented in reminds me of a really nice concept in mathematics, okay Now you could think of this as sort of this looping structure where The for loop and this is pretty much how it actually runs inside the computer, right? Where it initializes it i it runs this runs this thing five times and then executes That's one way to think about it You could also think about it any about it in a somewhat more abstract way And think of it as all we're really doing is we have a contract with python that says Eyes we're supposed to run this code five times and I supposed to be five four three two and one So you could imagine this might be what's going on the for loop sets i to five runs our code The for loop sets i to four runs our code the for loop sets i to three runs our code the for loop sets i to two Runs our code for loop sits i to one and runs our code all we know is our code was run five ran five times and By contract each successive time We're getting different value for i and the value for i is taken from this set And so this is just one way to think about it to say to yourself. Oh, yeah This is one way to think about it as it's actually and this is how it really works But this is also kind of logically the contract that python is making for us So up next we're going to talk about Taking this notion of doing something to a lot of items But accomplishing something with that and I call these loop idioms So now we're going to talk about loop idioms and loop idioms are patterns That have to do with how we construct loops. We have the mechanics of Fours and while's but ultimately we want to get something done We want to solve a problem with a loop and often what we have to do is If we have a set of things whether it's lines or strings or characters or numbers We're looking for something like the largest or the smallest or we want to add them up or something like that and so We can't just say add them up We have to say go through each one and do something to each one and somehow achieve adding them up And the pattern that we're going to follow is we're going to have this loop That's going to do all one run once for each thing Right in some chunk of data And then but we're going to set something at the beginning and then we're going to do something to each one And at the end we're going to kind of get the payoff. We're going to get the result So if we're doing sort of summing things We're going to have a running total and so this will be like t equals zero And then this will be t equals t plus the The thing value and then but this is not the real total. It's the running total during the loop But at the end it is the real total And so we're going to we're going to look at what you do at the before the loop starts during the loop and then What you get after the loop and how you can use that So we're going to use this loop It's just going to loop through a set of six numbers over and over and over again Right So we're going to do something before the loop or and do something after the loop and then we're going to run Loop some number of times and in this case thing is our iteration variable because I'm using unneumonic variables now So this is going to run, you know 9 41 12 3 74 and 15 So it's going to run and print these things out. So it runs this loop six times And away we go. Now this loop does nothing except print stuff out Of course, I like to do that first is always print things out To make sure that sort of my brain is as functioning So To kind of understand how these loops work I'm going to ask you to function as a program and I'm going to show you some numbers in succession And I want you to Mentally figure out what the largest number is but more importantly think about how your brain is solving this problem Of what is the largest number given that I'm only going to show them to you one at a time for a little while And your brain has to do something and imagine I was going to show you thousands of numbers I'm not but imagine I was how would you organize yourself in a way So that for like an hour and a half you could sit here as I showed you numbers and you keep track Of the largest number that you've seen of all the numbers. Okay, so here we go. Here's your first number second number third number fourth number fifth number Sixth and last number What was the largest number? Hmm, what was it? Well, it wasn't too hard It was 74 But that that's not the question. How did your brain arrive at 74? So here's all the numbers if I've shown you all the numbers and asked you um What's the largest number your eyes would have sort of gone? And then you got to 74 and and you wouldn't do it in any particular order Your eyes would just like see the 74 and it would just throw smaller numbers away And it would move really quickly to what the answer is even if there was Several hundred numbers on the screen your mind would sort of move Fluidly wherever it felt like moving and then arrive at it And probably what it would do is it would do something like You know kind of move like this find this and then sort of check to make sure that it's okay And then so I like okay. I got 74. I'm done But that's not how computers do it. That is not how computers do it. They do not move fluidly But they are highly dedicated. They're going to do something 74 But how would you construct a loop to achieve this? So let's take a look You could create a variable called largest so far and this is the largest variable the value that you've seen in the list so far And I haven't shown you any numbers yet. So we'll just set this to negative one to get us started so now We see three and we're like, oh, that's better than negative one. It's our first number. So it's probably the largest we've seen so far, right Great 41. Oh, that's bigger than the largest we've seen so far. So we'll keep it 12 is not bigger than 41 So we're not going to keep it notice this keeping thing 9 is not bigger than 41. So there's no point to keeping it 74 is bigger than 41 So we'll keep it. Is this the largest number? We don't know. We don't know until we're done 15 not better than 74. So now we're all done and Hurray, hurray, hurray, we have the largest number And we had this variable that we kept the largest number that we'd seen up to this point And then when we know that we're done at the end Then that becomes the largest So if you look at all the numbers keeping track of the largest so far at the end of all the numbers The largest so far and the largest are the same thing. And so that's how you get this idea Of something you're doing during the loop is not really the answer But by the time the loop is done, you will have the answer And so here's a bit of code that does this use it with our numbers, right So let's take a look So I have this variable called largest so far. I set it to negative one before the loop Remember, there's a loop before and a loop after and loop in the middle before it's negative one So now the num Remember underscores are okay. That's my iteration variable If nine is greater than largest so far. Well, largest so far is negative one. So that's true So this code's going to run. So we're going to remember The new number. So this is nine and so nine ends up in largest so far and then print it out So largest so far is nine After we saw the number nine Then We do it again So now 41 comes in And is 41 greater than nine the answer is yes, it is. So we're going to run this code Copy 41 into nine 41 into largest so far and then print it out and largest so far is 41 After we saw the number 41 Now we're going to run the loop again with 12 Okay, and you get the idea I hope is 12 greater than 41 Which is the largest we've seen so far and the answer is no it is not so we skip So the largest so far stays 41 even though we saw 12 I mean we're sort of like ratcheting up, but we never ratchet back down So we run it again with three and 41 And we skip this and then the largest so far is 41 even though we just saw three And now we see 74 is 74 greater than 41. See we never are looking at all the numbers We're only looking at the window on the numbers of the current number that we're looking at So is 74 greater than 41 the answer is yes So we run this code and then we capture the 74 So we've seen we just saw 74 and it is the largest so far And then we run it again with 15 But 74 is our largest so far and so it skips So 74 remains largest so far After 15 and now we're finished because we just ran the last thing before loop takes care of everything And jumps to this print statement and says afterwards largest so far is 74, but at this point it's also The largest right so largest so far became largest when our loop finished So that sort of gives you this notion of how we construct You know something at the beginning Some kind of thing that we're going to do over and over and over again And then something at the end and we put some print statements in just so we can watch it and see what's going on So coming up next we're going to talk about Some more loop patterns some counting totaling averaging and finding the smallest number So now we're going to look at some more patterns of the different things We can do at the top of the loop in the middle of the loop and at the bottom of the loop And the first one we're going to do is counting now. We're going to take a look at the number of Something the number of things in our list now we could just inspect it and see six But you will have four loops like you're reading through a file Or you know scanning through some data and so the notion of counting But you have to assume that you don't really know, you know Dot dot dot dot dot that there's going to be a lot more than just six But for now we're just going to do six and we're going to count how many things that we see In this loop and the pattern is simple You set a variable zork to zero at the beginning We often call this variable count in mnemonic And now we're going to run this loop six times one two three four five six and each time through We're just going to add one to zork so zork start at zero then it goes one two three four five six And we're going to print it out. So, you know, we see the nine and zork is one See 41 zork is two Minute zork is 16 when we see the 15 four stops and we print out afterwards and this then is six is then The ultimate count that we got so that's very very simple the pattern is That set it to zero at the beginning add one to it And if you run that enough times then this is how many that you how many times that happened And in a sense it's how many times this line ran Right sometimes you put this in an if statement etc etc etc Okay Oops Now we can do the same thing to get a total And the way the total works is you compute a running total Of the number of the items that you've seen so far and at the end the running total an effect becomes the total Will you a better variable name for this would be like sum or total or something but zork i'll use zork again So you set zork to zero And it starts out the total we've seen so far is indeed zero And then we're going to run this one two three four five six times and thing is going to be the iteration variable It's going to take on the successive values and each time through we're just going to take our running total and add to it The thing we've seen so we see nine and the running total is nine We see 41 and then running total becomes 50 we see 12 The running total becomes 62 we get a three it becomes 65 we get 74 we running total is 139 How many more How many more are we going to see we don't know it could be a million could be one Oh, it's only one we get a 15 our running total is 154 And what's true at any moment here is the running total is right Up above what we've seen so far now when we're done the four loop quits for us and afterwards 154 is indeed the total so they're running total while we're in the loop At the end of the loop after the end of the loop we have the actual total So it's not very difficult to convert this to the average because we calculated the count And we've calculated the running total and now we're going to have the average by simply dividing those Okay, so Now this time I've used mnemonic variables Don't get confused by this mnemonic variables are just friendly names I chose for you to read the code easier. I am not communicating to python in any way by naming this count and sum, but Count and sum is nice Okay, so I set count to zero and sum to zero. Oh go back up I set count to zero and sum to zero at the beginning And the count is zero and the sum is zero and then i'm going to run this loop six times one two three four five six And each time value is the is the iteration variable I count every time I run the loop I count equals count plus one Sum equals sum plus value so I have a running count and a running total And they show up here one two three four five six and then the running total And then at some point the for loop, you know, we do the last one and the for loop jumps out and it divides Six hundred fifty four is the count and running total and then it divides the average sum over count Okay, so that's just again a pattern of something in the beginning something in the middle Something in the end Another kind of thing we tend to do in loops is we look for things we hunt for things And so this is where we have an if statement inside of a loop And of course I've created a silly simple thing In this code I am looking for Large values that are values that are greater than 20 and again, don't think of this is just six numbers But I'm looking for all the values and I'm going to print them out So, you know it says before it's going to run this nine The wealth nine is greater than 20. It's false. So it goes back up 41 True So it prints out 41 then goes back up 12 false Goes back up three Falls goes back up 74 true. So it runs this. So how comes that little print statement goes back up And then 15 is the last one and that's false It goes back up and the four says we're done and then we do afterwards and so this is just the notion of having an if statement in side of a for loop where we're sort of Picking or choosing or selecting or looking for something in a large set of things that we're looping through We can also say I want to know if a particular value is there And so we're going to use a Boolean variable and we've talked about integer variables like 142 And then floating point variables like 98.6 and then string variables like hello world That have quotes in them. This is a fourth type type a kind of variable It's called a Boolean variable and it only has two values. It has true and false As a matter of fact these if statements They return Boolean values value equal equal three. That is returning a true or a false based on the value of Value There's a new monoconfusion there, right? But I'm using so I'm going to make a variable called found and that's a decent name for a variable So don't get hung up on that and I'm going to initially say Found is going to indicate to me whether or not I found a three in my list And I'm going to start before the loop starts Let's say false because we haven't found anything yet. So found equals false And so at the beginning of the loop found is fall before the loop starts found is false And now we're going to run this loop a bunch of times 9 is that true? No skip. 41 is that true? Skip 12 skip All right, so 9 41 12 and found has remained false because we haven't done anything to it But now in comes a 3 and this becomes true. So it runs this code So found becomes true And then we print it and you'll notice that when we see a 3 we get true And then it runs again. We get 74. It's still false 15. It's still false Run run run quit and the residual afterwards Is true and in fact if you didn't know any of this and you don't print that out All you know is that afterwards we loop through all those things and we know that there was a 3 in there That's what we're doing. So we searched all of them. We checked for 3s when we found a 3 And you can see basically that you know The found remains false until it flips to true, but then there's nothing to set it back to false There's nothing in this loop that's going to set it back to false So once it sort of catches the 3 then it remains true for the rest of the loop and then it just finds its way out Now if you want to think about it for a moment ask yourself How might we make this loop more efficient by putting a statement right in here think Think about a way to once you found it and it's true There is sort of no reason to keep on going. So what would you put there to perhaps make this Loop to look for 3s just to tell you whether or not there was at least one 3 in there Um, how to make that more efficient just think about that okay, so Now let's look back at the largest value that we started out with right and so if you if you think about this Let's kind of give it a Sort of a rough rough look here Largest so far is our kind of like a running total, but it's our our hypothesis is the best large number And we have this if statement that says if the number we just see right now is greater than the largest so far then capture it Right take whatever number we saw and capture it. So when we see uh nine, it's better. We capture it We see a 41. It's better. We capture it. We don't capture this We don't capture this we capture the 74 and we don't capture the 15 and that's how we do it So you could think of this as better When when the number we're looking at is greater than our working hypothesis of the largest We grab it because it's better. So this this line right here is the grab line Grab it Okay So then the question is how would you modify this code? To teach it to find the smallest value in this list of numbers Think of it as you have a starting number You have a sort of what's better in this grabbing notion How could you do that? Take a look Okay, so let's take a look So let's let's do a couple things like the if if you look at this if statement that's better Well, it's better now if the number is less than so if the if then but then we should probably change this to be Smallest so far smallest so far Smallest so far smaller so far smaller so far smaller so far right Matter of fact, that's what this is. We've changed the word largest so far to smallest so far and We've changed the greater than to a less than Is that going to fix it? We'll give you a second to look at it pause if you need Is that going to fix it? Is that going to find our smallest number? The answer is Of course No, it's not So if we run this code So we set the smallest so far to negative one and it starts out negative one we run it And it's nine is nine less than negative one No, it's not so After we see a nine the small so far is negative one now we're going to run 41 It's 41 less than negative one No, it is not So small so far still negative one as a matter of fact It isn't the smallest so far anymore just because we named it smallest so far doesn't mean it is the smallest so far It didn't work out so well and so you see that None of these because they're never less than negative one do anything and we claim that afterwards the smallest we've seen so far Is negative one and that is because of course Negative one is smaller than any of the numbers that we saw So how could we fix this? Well If we started the smallest so far with some like arbitrary big number then it'd be better So if we made this 100 whoops come back If we made this be like 100 That'd be good because the first time through the nine would be less than 100 So we would capture the nine and then the rest of the loop would work just fine But then what if we didn't know that how big these numbers were as a matter of fact The largest so far wouldn't have worked if all the numbers were negative think about that We just assumed they were positive and so we kind of wrote lazy code that assumed all numbers are positive That might not be a good assumption depending on the numbers that you're dealing with right So maybe a hundred's a good number to start with or maybe like a thousand Or 10,000 or like some number with lots of zeros in it. Well, how big should we make this? And the answer is we're kind of Solving this problem the wrong way And the thing we really want to do to solve the problem is To just accept the fact that if we're looking for the smallest number so far That the right Hypothesis is the first number And if we just knew what that first number was the nine that would either that would because it's the first number We know that It's the both the largest so far and the smallest so far as soon as you see the first number But we don't know Here before the loop starts what that first number is I mean you can look at it But assume this is just data that's coming from somewhere else and we don't know it until we start reading it So we have to construct a loop that deals with the fact that we want to capture the first value is our hypothesis for smallest so far So how do we do that? Let's take a look So what we do Is we use yet another type? So we have integer floating point string boolean and now we have a thing called the nun type nun type Is a special marker in that it only has one value boolean has true and false You know floating point has an infinite number of values and integer has an infinite number of values But nun type has one value nun nun is a constant capital nun is a constant The difference is as we can check to see if we have stored nun none is often used to indicate emptiness not Non-existence because because smallest doesn't exist until we assign it but we're going to assign it to like a mark A flag a marker some way to say oh, this is not even a number. It's nothing And so we're going to win you can do this. So that's like makes a variable called smallest And in it puts none It sticks it right in it's not a string. None. It's like a special type Okay, so that Actually captures the notion that before the loop starts the smallest number that we've seen so far is None. We haven't seen any numbers Okay, so Then we come in And we have an if statement and we have a new operator called is Is is stronger than equal sign And so if smallest is none That becomes true It runs this case And so then what it does is it copies this first value which is nine Into smallest and so we see a nine and the smallest so far is nine, which is the first value And again, we're assuming we don't know what the first value is before the loop starts So we use the first iteration through the loop as the moment where we capture that. Okay so Smallest is is the value and then we print it and we go back up and now it runs again With 41 41 is not none none is there's only one thing that's none. So it is not equal to none Smallest is not equal to none or is not none. So this is false. So it skips over here Then it asks the question is the value we're looking at 41 less than smallest Well smallest is nine in this case and this is 41. So that's false. So it skips that and goes on So we see 41 we don't take it and then you can see that this will never become true again This is pretty much false for the rest of the iterations of the loop That's false for the rest of the iterations for the loop So it just is going to run down here and ask this question And at some point we see a three and we run this code. We capture it. We see 74 We don't capture it. We see 15. We don't capture it So then the for loop skips out at the end We have the smallest and actually this would be a good technique for the largest as well Because it really is just a technique to put a marker in this variable So that we snag that first number or first whatever as we Read and parse through them So there is and is not operators are very useful in python You can think of them as like the double equal sign. They're asking a question And They're asking a question and they return a true or you know blank is blank returns a true or false. It is stronger Double equal says are these two can these are these things equal in type and value? So just as an example If I were to say Is zero equal to zero point zero It would say yeah, that's true But then if I says zero is zero point zero that would be false So that's because these two are the same value Wise and these two are not the same type wise. So is is stronger than equals meaning that it demands Equality in both the type of the variable and the value of the variable and no conversion is done And so that's just a very strong don't overuse is if you're dealing with numbers or even strings use double equals don't use is because Sometimes it it gets a little confusing. So use is sparingly I tend to only use is on booleans And on none types. I don't use is on integers and I don't use in Use is on floats and I don't use is on strings just none or true false And also is not is also an operator. So you just like blah blah blah is not none or blah blah is not false Okay, so we've uh been looping around and doing loops and loops of loops We looked at the uh the the indefinite loops the while loops that kind of run for a while The definite loop and we looked at break and continue as a way to either escape completely from the loop or Go back up and discard the current iteration of the loop We looked at none. We looked at boolean variables with four loops definite loops where you've got some kind of a set or a list or Some kind of sequence that you're looping through and then the concept of loop idioms where you do something at the top Something to each item and then something you sort of get a benefit at the bottom And uh and so that gets us through iterations Hello and welcome to chapter six and this chapter we're going to talk about strings and Chapter seven is the payoff chapter. So we you know up to this point We're still learning sort of basic building blocks and actually We're going to write a real program in chapter seven. So just learn this and the payoffs in chapter seven So we actually been using strings from the very first lecture because if you print hello world, well, that's a string And so we've been doing things this this little this slide here is all review Uh, we use plastic and catnate strings. We use print to print them out Prints just a function that takes as a parameter something strings integers, etc Um, we we we can put digits in strings, but we can't add to them By now you figure this out, but you can use things like ints to convert the strings To integers and then print things out. So, you know, we we've been doing this for a while We've been talking about strings all along now today What we're going to do is going to just get into Strings in more detail we reading we input data With the input function input returns us a string Um, and if we want to input a number we have to run some kind of conversion like we have to do on int Before we take this data that we read from input, you know And so there's there's things that we've got to do and we've been doing all these things in in programs so far But if we look a little in a little more detail inside strings, we can uh index within strings each character So each character has a separate position and a separate index And they basically the the letters are have positions and the positions start at zero And the best way to I explain this to remember this is it's the Elevators as we used in one of our examples a long time ago Elevators in Europe start at zero and so strings start at zero as well Turns out in the old days, there's some efficiency with the notion of Lists of things starting with zero these days the efficiency isn't the issue But there's a certain elegance Starting at zero even though in intellectually you might think one would be the the first Character in the string might make most sense to be sub one, but it's not it's sub zero, but just remember that um And so we have this operator called the index operator and it's square brackets. So You know fruit is a variable That contains the string banana And then fruit sub one is the character that's in position one now that actually is the second character I'll keep reminding you until I get tired of reminding you. So that that assigns a the the letter a into um I mean a the letter a into the variable letter, of course, that's a badly choice It's a either a well-chosen variable name or a badly chosen variable name Um and the thing that goes inside this can either be a constant or it can be an expression So this is x equals three and then fruit sub x minus one well that means two Which is position two which is an n and so that gives us An n so the index is an operator and you can add this bracket syntax to the end of a string variable You can't uh index beyond the length of the string. So if I say zot sub five Well, there's only three characters, which means zero one two But sub five doesn't doesn't work and of course we get a happy little trace back Mm-hmm. So you have to be careful when you're starting to pull stuff out of strings Although some of the things allow it some of them don't and then you'll kind of get used to that We can ask how long a string is And so we use the len function we pass the string variable And we pass it into len as parameter and len gives us back the length of the string not the position So it's zero through len minus one So it's zero through length minus one So length is just another function That we've been doing functions now for a while you pass in a parameter And then len does some work an outcome six and that goes back into x because the function has a residual value It just happens to be a built-in function And so, you know somewhere deep inside python There is code that takes this and somebody wrote a loop or look something up and then return to return value And sent back a six to go into our x variable And so function is there and like I said, we've been using this for a while Another thing we tend to do is to look through strings and look at strings and dig data out of strings Python is excellent for doing sort of these kinds of lookups And so we can write a simple loop. We can write a for loop that Creates some kind of a iteration variable like index And given that we know that these positions are zero through five We can set this to be zero and then write a while loop while the iteration variable is less than the length of fruit And remember this is six. So it's going to be zero through five Zero through five are the the very values we want to generate And then we can look up one at a time pull out fruit sub index So fruit sub zero fruit sub one two three four five And then print out the position and the letter index and then add one to index and it runs This will run six times zero through five and now we go to produce this output right here And so that's one way of looping through strings. That is a basic Indeterminate loop, but we construct carefully an iteration value Construct an iteration value and Work our way through that loop data The other way is to use a determinant loop a for loop and Generally when we are able to use a while loop or a for loop all else being equal We generally prefer a for loop and so here we have the for keyword and fruit and it's an in And so for letter in fruit. Well, that just says letter is our iteration variable And it's going to take on the successive values of each of the characters So this loop is going to run six times and letter is going to be b a and a and a banana I'm always terrified when I make these slides that I'm going to miss bell banana because somehow I always think that there are two ends Somewhere, I don't know It's not one of my favorite words to spell I actually didn't choose banana as the constant the author who I borrowed the textbook from Alan Downey and Jeff Elkner They used banana and so I'm still using banana So some of the jokes in the book aren't my book aren't my jokes. They are the jokes of Jeff and Alan So here are just two Equivalent, you know, so you can have the while loop. They sort of both do the same thing They both just print the letters out one one time through each of these loops runs five times But you can see how the the determinant loop the for loop is a prettier loop Unless you truly somehow need to know this number as you're going through the loop But if all you're doing is going through and you want to touch in order each of the characters of the string You then simply write a for loop because it's more elegant the less code you write The less code you write the less chance there is for you to make a mistake And so the fact that these are equivalent This is three lines that well two lines of a loop and this is four lines of a loop That's twice as many places as you could make a mistake because you might you know Misspell index or something. I mean why even make an iteration variable if you don't need to make an iteration variable And so we can do things that Harken back to our iterations and loops chapter Where anything that you can do in those things like look for the largest letter look for the smallest letter search to see if a letter exists or say Count the number of a's in the word banana and so that's what this is doing And so we um, so so we have a counter So again, we do something at the top of the loop We're going to do something in the middle loop and then we're going to print it out at the bottom So we start our counter at zero We're going to loop through b a all the letters and then if the letter is a then count equals count plus one This is kind of a pattern in a loop where we're noticing something and instead of like we did it earlier Where we said found equals true. Well, we're going to count them this time So if we have one we'll get one if we have zero we get zero and how many ever there are but there should be three Because it's going to run three times and there's three a's in banana And so this is uh, you know a conditional within count. We've seen counts. We've seen conditionals in loop In prior chapters and so again, I love the in keyword in python It again reminds me of a set notation in algebra If you're uh, if you're a math whiz if you're not don't worry about it or maybe you will be a math whiz And you'll say well this set notation reminds me a lot of the the in uh keyword in python So um Again, it's for iteration variable letter again, don't get stuck with letter I just happen to be using it here in banana and that is for Each character in the string banana Run this loop once Changing the variable letter to be the particular character that we're pointing at and so it's taking care for is taking care Of a lot for us, right? And so this is sort of this really smart for loop The for loop is you know, both deciding how many times to run the loop in this case six And it's advancing the letter so advance print And you know decide whether you're done advance print decide whether you're done Advance print decide whether you're done advance print decide whether you're done advance print decide whether you're done advance print Decide whether you're done. I am now done Because I whoop You know, we're done with that particular string and so You can think of the four as you know, magically Doing all of this for you of both deciding how long to run the loop when you're done or not And moving down through all the success of letters in the loop So next we'll do talk a little bit about additional things that we can do with strings So now we're going to dig into strings a bit and we've already looked at how you can pull out a single character in a string And now we're going to look at what we call slicing and that is pulling chunks of a string out and again, we're going to use the square bracket operator and uh, and so so s The way I say it is sub s sub zero through four. That's how I read this s sub zero through four So I look at the colon as through I look at the brackets as sub And so uh, s sub zero through four says start at position zero And then go up through but not including four right so we don't include four So that's probably the hardest part of this up to but not including up to but not including This seems counterintuitive kind of like starting at zero seems counterintuitive But after a while You'll kind of you get used to it and there'll be situations where you're writing code like oh, that's why that works better But just for now remember it up to but not including it's just kind of a little thing We'll we'll come back to when that is useful for us Six through seven well that ends up being starting at six up to but not including seven So that's why we only get the p out Um now one thing that python is pretty nice about is it's not going to give you a trace back We might expect that six through 20 well, there's no 20 characters, but it's like ah That's okay. We'll just let you stop at the end and we'll start at six and go all the way to the end Oh, no trace back. It's almost disappointing sometimes when python, uh, doesn't trace back when you think You know, if you're so obsessed about everything, I would have traced back in that situation But hey, it's I guess if you're you're allowed you're allowed and so there we go Now you can eliminate or omit the first or last If you eliminate the first it assumes the beginning of string if you eliminate this eliminate the second it assumes the end of the string And why you would do this? I don't know but that's from beginning to end So it's the whole string so whole string eight through the end Is thon and up two but not including two Is moe right so so you get that so just it's that's pretty simple Once you got the rest of slicing and the rest of string indexing The notion of eliminating the first or the last between of the colon expression The set first or second of the colon expression. I think is actually pretty intuitive pretty nice We've already been concatenating strings together. We overload the plus operator And there is no space added. Remember when you're doing print x comma y This comma does turn into a space But that's not what's happening here. There is no automatic space being added And so we see hello in there and it just is hello there with no space And so if we want we just have to concatenate the space explicitly if we want to put spaces into strings The problem is is if you if this you might think it's more convenient to add a space with concatenation But then you have to think well, what about if I want to concatenate things and not put the space in then I'd need a different operator So that's kind of why it works that way We can use in differently as a logical operator So we're using it in as an iteration Structure in for loops, but we can also use it as a logical operator In if statements, so it's kind of like the you know, double equals or not equals or less than or equals or something like that It's it's like those guys and um, and so And it returns a true or a false is n in fruit So that's a question and the answer is true is m in fruit No, that's the answer to a question is nan in fruit doesn't have to be single character can be more than one character And the answer is true And then you say something like you know if a in fruit and so this is the logical value that returns a true or a false And yes, we found it So that becomes true in this particular case. So it runs the little indented bit So in is an operator in this particular situation in a for loop in means something different And we'll use in for other things as operators as logical operators coming up in a bit You can compare strings and this has to do with the character set of your computer the character set that python is But in general, um You know, it is lexographically less than and lexographically greater than uppercase and lowercase are a little weird Um, I think when we used the max function earlier the way my computer was set up Um uppercase was less than lowercase but In general uppercase is less than lowercase But in general it's it's bad to assume case. Um, but there is a deterministic way to sort strings You can you know have something equal to or less than or greater than And all those operations work Naturally the less than and greater than you have to kind of be aware of uppercase lowercase things like where You know punctuation sorts less than less than or greater than letters It's that's kind of unpredictable and depends on the character set of your computer and something you just play with and Figure out if you're doing sorting stuff by first name and last name as long as the case is kind of the same um, you know if um If you were sorting chuck with an uppercase and glenn The fact that these uppercases they'd sort right and these lowercases would sort right But if you were to subdue instead um lowercase chuck An uppercase glenn Then that would sort weird as a matter of fact this the g would come before that and so case can mess this up But in general other than case and special characters and other things It technically works. Um, it's just hard to kind of predict it A lot of what we do is use the string library and so the strings are Objects and we'll talk later about what that really means And objects have these things we call methods So a string object has some built-in capabilities And one of the built-in capabilities that the string object has is Here is a string object and because greet is a string object if we said type we'd see that it was an str Dot lower says hey dear string Make a lowercase version of yourself. It's like calling this function lower and passing greet into it And then give me that back to me now. It doesn't actually change greet. It gives me a lowercase copy So here I have hello bob with an h and a b uppercase and what I get back in zap is hello bob all lowercase And note that greet is unchanged. So hello bob is still there And you can even call these methods on constants. So this is a string object quote high there quote dot lower that says Call lower on this bit of string and give me back a lowercase version of it And so it prints out as the residual return value This is like a function call a method call is a kind of special form of a function call It's a function call where you say the thing dot the function name rather than Function name passed in as a parameter like len for example is non object oriented You know len of x that's not object oriented object oriented to be x dot something parentheses But you so constants are objects as well and taking the lower gives us back lowercase high there And so that's just one of the things that you can do in the string library These are built into string variables and constants. They're just always there as soon as you make a string They're part of it and when you do type and it says it's class str We'll get to object oriented. Don't worry. We'll get to object oriented Okay, and so you can do things like use the type Um, if you're just look this used to say um type stir, but it's class stir Kind of this is more of an oh the word class as an object oriented concept But it is a string and you can use the dir and of course there's extra stuff up here And this is showing all the different methods Or capabilities things we can do to strings. So, you know x dot Something parentheses. Well, what can we do there? This is all of those things that we can do to x's that are that are built in and come with x's I mean come with strings Uh when we build them And python, of course has great documentation online for all of these string methods and what they do and how they work And why they work the way they do And so here's some of that python documentation. We'll look at a few of these Um, but you know, don't hesitate to say python string uppercase and then we're like, oh, yeah Yeah, that is upper Right and so here's a few things that we can Do and use some of the ones I use a lot and we'll look at each one of these things um So the find operation Says find me a substring within a string Right find me a substring within a string. So find me the first na and give me back the position So that gives me back two And then I can say go find a z in there. Well, there's no z and so it returns being negative one So that's what the find does. So we're do a lot and we're going to use this kind of stuff a lot And we do a lot of looking in strings Um converting things to upper or lower case. There is an upper method and a lower method So greet greet dot upper and that means the uppercase Nnn is hello bob greet dot lower that means that dub dub dub is the lowercase hello world and greet is Unchanged greet is still hello bob with upper and lower Because each of these methods basically say i'm going to give you back a Upper case copy or a lowercase copy of the original thing without changing the original thing Search and replace is super useful super duper useful And it's pretty clean Here we have a string and we use the replace method in this case. We're passing in the old and the new Bob replace all bobs with janes. And so that takes this hello bob and turns it to hello jane Again greet is unchanged Greet is unchanged Um, and it does more than one thing. So this says go find Well, let's clear that this says go find all the os and replace all the os with x's And so it goes and finds two of them and then outcome to x's And so that really is a replace is not just replace the first one, but replace all of them White space as we'll see is a big deal And white space is not just blanks although the most common thing But it's also sort of non printing characters like tabs and new lines and other kinds of things And so we have a number of different ways to strip white space Um, so here we've got some spaces at the beginning and spaces at the end And we print out we do an l strip and that throws away the spaces at the beginning. That's the left So that's a left strip. It all takes any if there's nothing there. It doesn't harm it Our strip means throw away all the blanks on the far end and then strip says go take Take both sides both sides for strip. And so that pulls out all the spaces on both sides This will be useful because sometimes when you're tearing stuff apart you find yourself getting extra spaces Sometimes at the beginning sometimes at the end and it can be tab or new line It's it's sort of White space space that um is kind of not visible clear. That's what white space is It's like if you run a piece of paper, it's the it's the white space. It's like x. Well, that's not white space But right here. Oh, that's white space. It's any character that doesn't cause printing to happen If that makes any sense It's any character where nothing would be printed and there are characters like that There's like even bell characters, but we don't use them very much We can ask very conveniently. We can say hey does this line start with a particular string And so, you know line does this that's this is a question going to return a true or false Does this line start with please and the answer is true. It does start with please Does this line start with a lower case p? No, it does not and so again you use this in the context of if something colon some block of text some block of code So we can combine these things to tear stuff out And so let's assume that what we want to do in this case is we want to take a from line This is from an email form email format from a mailbox um And this has got the from with a space And the person's email and then at sign in the school they're from And a space and then the rest of the stuff like when this mail was sent And this is a real mail message from this guy steven from the university capetown in south africa It's really steven and this really is the first line of a file that you'll get to know pretty well by the rest of this course Hi, steven You we like you You are the example in my class and have been for a long time People actually who know steven have taken this class and they're like steven I saw your picture in the class So if you're ever in capetown at the university capetown say hi to steven and tell them that you saw him in the class But okay, that's neither here nor there What I really want to do is I want to extract his school from this email line okay, so Now eventually we will do things like you look the data will come from files But this is still chapter six. So this is the data. We're going to search through and so we can say hey Let's go find the at sign Search up to this position and find the at sign. So data dot find at sign and give me back where that's at that's in position 21 position zero then What we're going to do is we're going to look for the next space after the at sign So we're going to start at the at sign until find to start here and look forward until it finds a space So data dot find look for a space starting at the position of the at sign And then that'll be in position 31. So 31 is what we get in the space position So now what we have is we have in two variables. We have Position the position of the at sign and the position of the space after the at sign now What we really want is this bit right here. So we have to go one beyond the at sign And we don't want the space. So we say we're going to use slicing here data sub At position plus one up to but not including the space Oh smiley face because we didn't have to say space minus one because that is up to But not including And so we get that little bit right there So we don't have to this we don't have to say minus one there Because this is not actually included the thing that's at the position the space is not included So that's already a little benefit for the up to but not including and so when we print this variable out host We get exactly just the school that's steven Works at and probably went to as a matter of fact I don't know if you went there or not. So this is just kind of a note um For uh non latin character sets, uh, you know all programming languages from the 60s on tended to Work in what we call the latin character set, which is united states and england and europe and lots of places use this abc Character set and the special characters, but it's really common to want to Use different characters and so if you're going from python 2 to python 3 And we'll talk about this a little later when it matters more um, luckily we're in python 3 and so Python one of the big things about python 3 is that all the internal strings are unicode in the in python 2 There was sort of some confusion as you went between strings and this is just a little bit of code And so i'm putting a in here You know some uh aging characters. This is korean actually Asian characters into x and i say what kind of a thing this is and that is Uh a string and then there's this unicode and this comes from python 2 If it's a unicode operation It's still a string whereas in python 2 if you put a International characters into x then it was a string and then there was a separate kind of a constant called a unicode constant And it was a different type and there was ways that you had to Mess with these unicode variables As you did things like read them from files and put them back into files and did other things So it was much more difficult In python 2, but we're doing in python 3 and in python 3 it natively understands non-latin character sets international asian character sets Spanish french character sets And so this is a good thing for python 3 and this is one of the real benefits of using python 3 And as we start doing stuff where we're exchanging data with the outside world This will come into play and I'll have to show you how to use it Um, there was weird things that you had to do. It just makes a lot more sense in python 3. Okay So we've talked about strings. We learned about the string or converting it We've done a whole bunch of stuff and this is again You know, we're not we're not yet doing anything super useful We're learning sort of how to like slice and dice even though we're sort of not making the meal yet Up next we're going to talk about files. We're going to read some data And we're going to slice and dice and use all the things In the next chapter that we've learned up to this point. So see in that. See you in a bit Hello and welcome to chapter 7 This is the chapter where it all really starts to pay off. We have been learning bits and pieces And doing little Two lines three lines four lines of code to learn the basic building blocks of python and learn some of the syntax And find lots of terms, but now we're actually going to start doing something So if you look at what we've been doing so far You know, we have been We are inside this little computer and you type up you, you know, the python says what next and you give it its command And it does something and you do something else and does something and you do this three or four times Unless you write a loop and then it goes like, you know, 10 20 times and that's it And then maybe we write a thing that reads something from our keyboard gives us something back And then we write something to print something out print a few foot things out And so we've been pretty much using the keyboard the screen The cpu and the memory that's kind of where we've been living And while it's important to talk to the keyboard on the screen The the real world is things like databases that live out here Files live on our systems and you know connecting to the network and reading Reading data from the network. And so that's what we're starting to do right now is we're starting to Be able to work outside kind of our code and create things that are permanent And so we're going to be talking initially we're going to work on files We'll later talk to databases and the network and other stuff But for now we are talking about files and so really kind of We're stepping out a little bit and creating reading things that are prominent and creating things that are permanent The kinds of files that we're going to talk about mostly are text files and you can think of these As a sequence of lines in a file that are easily read by python Um, you you've been making text files all along. You're you know hello dot pi That file is a text file too. You're using a text editor to create that file You put your python commands in a file you run those files And that's what it is and so a file can be thought of as a bunch of lines You know of one two three four five six seven a blank line here. That's possible and um But the but the reality is is that these are actually just lines And we have a special character called the new line that we'll talk about in a second So to read a file, uh, you have to call the open function And open returns what we call a file handle open doesn't actually read the file open makes it possible So that you can read the file So the the parameters to open are it takes uh one parameter It's required which is the name of the file another parameter It's optional whether not to read it or write it if we're reading the file It doesn't harm it you can read it over and over if you write it It actually if there's already data in that file it truncates it and write something And we're not going to really write files We're mostly going to read them and so open sort of you pass it in a file It gives you back this file handle and then you have a variable in which you store it I often call it f hand to be mnemonic Just you'll see my code. I use f hand all the time to indicate that that is a file handle And so if we were to run this in, uh An interactive mode will open mbox.txt And that is a function built into python and then it gives us back a handle Does not give the data and kind of see this When we print out the file handle using the print statement It doesn't print the lines that are in the file the lines that are in the file are sort of out there There could be like, you know, 10 million lines for all we know lines in the file Handles like a little opening outside of your program And you can talk to the file by opening it then you can read stuff You could if you're writing the file you can write stuff And then you close the file to shut the handle down But handle is a thing that allows you to get to the file It is not the file itself and it's not the data in the file It's just a Rapper that kind of allows you so this if you print it out. It's like that's the file we open We're reading it and encoding has to do with the different kinds of character sets Which we talked about at the end of last lecture that unicode character set, etc UTF-8 is a great character set. It's it's Probably the most typical character set that you will run into a although you can have Different character sets of files, but most of them are UTF-8 So of course, this is python if you make a mistake and there's a file that doesn't exist We get a trace back and it blows up um We'll show you how in a second how to deal with that now The new line character is an important part of file reading And in files and strings we can put the new line character in by this backslash n character And the backslash n is the character that indicates that we're supposed to go to another line Go to a new line Go to a new line and so we have what is this? Well, that's a backslash n. That's a backslash n And so if we print it out we print it this way we see that the backslash n is in there This is how we type it we actually type backslash n to python to indicate that we're supposed to put that there um But if we do a print statement it actually interprets the backslash n And so the backslash n causes kind of this movement to the beginning now the print actually At the end of this adds another backslash n so so the backslash n that we put in by putting it into this string Is that one and then print always puts a backslash n at the end There's actually a way to override that backslash n behavior by putting something on the print statement Which we'll talk about later now It's important to note that the backslash n is one character, right? And so even though this x backslash n y prints this And then print adds another new line to go down to here if you ask how many characters the what is the length of this Well, it's only three That's because that's a character the backslash n's a character and the y is a character So it's a three character string. So the backslash n is a character like all the rest of the characters but it's only We we encode it by typing backslash n It's called an escape where the backslash is the escape backslash n is a way to say new line Because we can't see it. It's a way for us to encode in a string this non-printable character this invisible character The white space it's part of white space So as we're reading through the file We can think of it as a sequence of lines and we can read these a line at a time We can also read them a character at a time if we want and so but it's more common to say read this line Read the next line read the line after that etc etc etc But the way to best think about this it it it doesn't really matter You can think about it as lines and we will and most of the programs that we write But realize that the way when we see this um We see it like this It comes back to the beginning it comes back to the being there's a character in the file At each of these points to say go back to the beginning It's like hitting the enter key on your computer and that is a new line So you have to think that in the file in order for your text editor and python and everybody to know where the lines end You put new lines in the file and that's another character So You know this looks like an empty line This line here looks like an empty line But really it has a single character and the character is a new line And it turns out that in a bit we're going to need to keep track of the fact that Every line is ended by a new line So up next i'm going to talk a little bit about how to read files in python So we're going to find that there's a number of different ways that we can read through the file But the most common way that we're going to read through the file is to treat it as a sequence of lines And we're going to use the determinant loop the for loop To do this and so what happens here is we get back this handle That opens the file and gives us back the handle that handle x file Is the variable i named a call i just named it x file That's not the data, but it is a sequence It is that file handle represents to python a sequence that we can Potentially walk through and then get all the lines And it's the simplest most beautiful elegant way to read all the lines in a file We use the for loop and we have an iteration variable This is going to take when we talk about the file Cheese is going to be the first line then the second line and third line then the fourth line So it's it's like going through a string But you're going through a file now and you're getting it line by line So that's each line. I just picked a variable named cheese. So you didn't get confused later. I'll call this line But but python doesn't know anything special by naming that variable line Okay, and so this is it's the four and the in and so so for I read this as for each line In the file x the file handle x file So run this loop one time for every line and then print it out. So it's actually really quite simple Okay, um other languages like c or c++ or other languages They have to write while loops with end-of-file conditions and all kinds of things that make this very difficult But this is one of the prettiest things that python has. It's uh, it's a very very pretty thing Okay, so let's talk about what we might do and we're going kind of back to iterations now What if we wanted to count the number of lines in a file? Well, this is a basic loop counting pattern So we open the file and then like in all these loops We do something to sort of prime the loop to get it started set a variable count to zero And I'm going to use the variable line that's going to go through each of the lines in the file Four line in f-hand down the file and this going to run this loop once for each line in the file And the variable line is going to change But all I'm going to do is add count equals count plus one and so that's just like from counters That's just how you detect. So every time we see a line. We're just going to add one to the counter We're not printing the line. We're not even looking at its data at this point And then when the line is done however many times it has to go out it comes and we print out line count equals count And so if we open mbox.txt This is going to do all this work and then print this line out and say line count is 132,045 So this is a little five line program that shows you how to count the lines in a text file using python again simple and elegant and not too much syntax for you to have to learn Now it's also possible to read the file as a series of characters all in one go Read the whole file it now. You got to be careful depending on the size of the file This is going to lead to a string variable with a lot of data in it now if it's you know 100,000 characters that's actually kind of a small thing But if it was uh, you know 10 million lines that would probably not be good You'd want to read it one line at a time and process each line and then do something But mbox short.txt is a small little file So we open it And we get back a file object file handle object And we call the read method and that says look go through and read all the text and give it back in one big blob One big string and I'll put it in imp And so that's where you have a line a new line a line a new line a line a new line So not really lines. It's just a sequence of characters with new lines in there to punctuate them And now you can split that later. We'll see how to split that Into separate lines if you want now I picked a file that was short And so this imp variable now has a string in it and I can use the lend function Pass a string into the lend function. It says oh 94,626 characters. That's kind of a small A small little file and perfectly okay to read it all in one go And so now I say just print the first 20 characters That's you know beginning to up to but not including 20 and so it shows the the first 20 characters of that little file is a front line because this is a mailbox file Now let's say we're going to do a searching and we did this loop where you're looking for something And so we're going to search for lines that have a prefix of from Okay, that's what we're going to do and we're going to print those lines out So there's lots of lines in this file You know line line line line from Line line line line from right on and on and on and we don't we only want to show these lines the ones that match right? That's what we want to do and so We are going to write an open statement and then we're going to loop through And we're going to ask the question if the line starts with from print it So sometimes it's going to skip skip skip skip and that's going to run it and skip skip skip skip skip It's going to run it skip skip skip and then it's going to run it Okay, so That's the basic idea and that'll that'll finish when it's all said and done And so this is like a criteria. This is like a search We're looking for lines that match the string that have their string from as their prefix Now when we look at the output of this, it's kind of weird We see Kind of these little blank lines that show up blank blank blank blank blank blank blank blank What's going on here? What's going on? So let's take a quick look The problem is is new lines Well, I mentioned that the file has new lines in them And so when you do the for loop, it doesn't throw the new lines away as you might expect It would be kind of nice if it did but it doesn't it actually shows you when it when you read It reads that first line up to and including the new line and gives you that back as the variable So that is the first new line. So that means it's going to go down And then the print statement actually adds another new line So that's the the second line of the file has a new line at the end of it and the print statement adds another new line So if we take a look at the code There is A new line oops come back If we take a look at the code this variable line has a new line in it Oops, where am I at? I'm in the wrong slide. There we go Yeah This is what I want to do If we look at the code, there's a new line in here and then the print adds another new line So the print adds a separate new line And that's how we get two new lines the print statements new line and the new line from the file Here's how we fix it and you're going to write this code a lot because when you're reading text files You end up with a new line and often you don't want the new line, but thankfully As we saw in the previous chapter where there is a nice little function in Python for strings called strip that allows you to throw away white space and to review Remember white space Is anything that doesn't print and this new line is not i'm non printing characters So our strip gets rid of it So it's a way to get rid of white space and our strip does it from the right end So it's the right end of the of the of the string And so If we just are going to loop through all the lines in the file We say line equals line our strip and then this variable no longer has the new line at the end of it We have our little if statement and if we print it Then this line the data has no thing and then the print the data has a no new line in it So the print only goes down one and so now we have single spaced Output and so you're going to be doing that a lot It's really common to read through a File and then just strip the new line or any trailing space off the end of that Now there's a couple of ways to do a loop like this and let's let's just think of this as We're looking for a line a file with lots of different lines in it And we want to ignore all the lines except some say good lines And we want to do something with those good lines or the lines we're looking for needle in a haystack This is like searching for a needle in the haystack So if you look at this code at a high level, we're going to loop through everything And then we're sort of picking which lines are and these are the good lines down here Now often we have a bunch more code that we want to do and we're not just printing them But we're going to do a lot of code So sometimes you actually structure the loop a little bit differently And so the way to do it and this is going to do the exact same thing It's just a little different way of thinking about this loop So the top part is the same we're stripping it and what we're doing here is it's everything's the same here Except we add this is not if the line does not start with from that's the Translation of that if the line does not start with from Continue so basically we have a skipping pattern So the lines were not interested in we skip so we come down we you know skip a lot of lines And then we find a line that's good and then we fall through So this is the good code and then we have all the other good code that we want to do to that line We have that showing up down here Um And so there's just two patterns that are two ways to do the exact same thing So another way to select the lines that we're interested in is to use the in operator So we talked before about the in operator and how that works So we're basically going to use the continue skipping method So we're going to read all the lines these first few lines if UCT dot ac dot z a is not in the line skip it And so this is going to print out all the lines that have the string uct acs at a in them in them And so you see this is the output of the program dot dot dot dot dot Sometimes you'll have programs that want to read different files often I give assignments where I say show me how this program runs on the short file and then show me again How it runs on the long file just like this and so the way we do that to input the file name It's that are making the file name be a constant to the open call We make the file name be a input So we just run an input statement Which gives us a prompt and then we type mbox dot txt and then that shows up in this variable f name It's a course of string all the time and we pass that into open and then we open it and then we do you know the count operation So if we enter n box dot txt it counts 17 97 Subject lines in m box and if we give it m box short it says there are 27 subject lines in m box And again, this is another one of those ifs and it's just counting but only counting lines that match a particular A particular pattern Okay, so now the user can also type bad file names and we need to be able to deal with that as well And so we we're taking a small small change To the code the danger the dangerous code is this line right here this line right here It's going to trace back if that file doesn't exist. So what do we do? Well, we're going to just expand that the rest of this program is exactly the same The only thing's different is we we've got this line We took out insurance on it And we know that it might blow up and so we we have it in a try and accept block So here's how the code runs So, you know the input runs we type in a good file name. It comes in here This works and so it skips the accept and so it runs the code and prints out the count So that's the good pattern the bad pattern is Here we type in a bad file name. It comes in the try accept This file name is nanabubu and it's going to blow up. So this line blows up So it jumps down into the accept code prints out file cannot be open. So prints this out Now this quit is really important Because if we don't put this quit in here, it's going to continue down here And that's going to blow up here because file handle is not defined properly at this point And so what we have is we have this quit quit is a special function where it comes in And never returns. So this is a way to terminate the entire python program Silently with no trace back, right? So we put in our own error message So we look like we're professional say if we could not open this file And then we stop if you don't it's going to come down here and it's going to trace back Trace back right there. It's going to blow up. So The quit is useful when you want to stop executing because you've detected some kind of an error So that's a quick zoom through opening and reading through files and doing some patterns Most of the rest of the programs in this course are going to say open for our strip Do look for and then do something interesting That's going to be our loop that we're going to do over and over and over again And now we see how this looping and if and iteration and variables Are are starting to come together and you can actually sort of do a program that does something useful But before we get to too many more programs We got to switch a little bit switch gears and talk up next about data structures and that is the shape of data Um, and how we can use more intricate and complex variables to help solve our problems Hello, and welcome to chapter eight. We're going to talk about lists in this chapter Up to now we've been talking about algorithms algorithms are the concept and computer science of Using the programming language to express the steps that you want the computer to go through to solve the problem Read some data Convert it to a floating point number check to see if it's greater than 40 do one thing if it's greater than 40 Do another thing if it's not then print out the result or Open a file read everything if the first line starts with something do something if not skip it And then add all the things up Those are steps. Those are a series of steps And hopefully by now you're getting to the point where you have a good understanding of steps But there's a whole other side of computer programming and we call it data structures and data structures Is not the steps, but instead clever ways that you Lay out the data and clever ways that you make sure that the data does what you want it to do And so that's what we're going to start talking about now Lists are the first and most the simplest data structure strings are kind of like data structures But lists are probably our first real data structure that we're going to think about and design and make use of effectively But before we talk about what is a collection We should talk about what is not a collection. So we're familiar with what a variable is We know that a variable is a little piece of memory that's got a label on it And then an assignment statement, you know sticks a 2 into x and then x is and then 2 is in this little Covered and then it goes to the next line and then 4 goes into x and so the 2 goes away and the 4 is there A key thing is you can't have more than one variable at any given moment Right and more than one value in a variable. So when we move to collections Collections are more like suitcases. We can put lots of things in them We have ways of organizing them and as we go through lists and dictionaries and tuples We'll see how there are different ways to organize them and as a matter of fact, we've been talking about lists for a while Every time we use one of these square bracket syntaxes in earlier programs We've been working with lists and so this is technically a three item list with three strings got commas here Joseph has one string glen and sally or another string and here's another one That is another thing and the list is basically It's a list constant and it's being assigned into a variable So this friends variable has three things in it. So that's different than what we've been talking about before So these brackets and bracket structures with square brackets are those lists And so the print is just a print with parentheses to get the print to work But one 24 76 is a three item integer list Red yellow and blue is a three item string list But it doesn't all have to be integers or strings python can handle Different things and different kinds of data in different positions in the list. So red 24 98.6 a three item list with a string an integer and a floating point number And while we're not going to use this too much for now This outer list is a three item list and the second item is another list So this is kind of alluding toward what we'll do when we start talking about data structures And that is we have a structure and then we have another structure inside of it And sometimes this can get quite complex and we're doing this for a reason this here has no reason Just to show you that it's possible that that list can be made up of lots of things Including other lists and of course there is also the notion of the empty list And like I said, I have had to be able to tell you about lists all along. We use them in for loops We can put lots of things here. We can put file handle here We can go through the file we can put a string there We can go through the characters in the string and in the list and the iteration variable then goes through the Successive elements of the list and that's why this prints out for y 4 3 2 1 and then the loop is done and it prints out last off So we've been using them and we've been actually iterating through lists with four statements all along So the for each I mean the four statement You know has has been something we use with lists and every when you just need to go iterate through the list and go Through every item in order The four the four is a great way to do that So friend is our iteration variable friends is our list variable And so that says friend is going to successfully take on the value joseph glenn and sally and print out You know happy new year joseph glenn and sally runs three times once for each of the values and the iteration variable advances now I do want to make it really clear that the choice of friends uh And friend uh singular and plural is arbitrary and capricious. It happens to be convenient and intuitive that the iteration variable is one and the list variable is more than one But python has no idea about singular and plural as a matter of fact python would care It would be totally equivalent for python to do the same thing to have the list variable bz And the iteration variable bx x will take on the successive values of these three things Now am I being nice to you by calling this list friends and this iteration variable friend? I am but I also don't want it to confuse you if you're just a beginning developer So just like strings we can sort of look within list part of the Thing is when you put more than one thing in a data structure you need to get them out And so lists have positions they maintain order And so the first thing in the list is the sub zero position sub one sub two just like strings They're zero based just like european elevators. They're zero based So if we take a look and we say oh friend sub one, that's how I read that the little square brackets When you take a variable here and you say friend sub one remember singular and plural don't matter Friend sub one means glenn because this is the zero and that's the one and then sally's the sub two And so that's what prints glenn out in this particular thing Now lists are mutable mutable is another word for changeable that can be changed Meaning that a list has three things you can change this thing right in the middle if you want To take a look at what's not mutable strings are not mutable So if I take a look at assigning banana into fruit while fruit sub zero is a capital letter b Could we imagine for the moment that we could change fruit sub zero zero to lowercase b? Well, the syntax would be how you would do it if you could do it, but it turns out that Strings are not mutable meaning they're not changeable once you create them And that's why when we do things like lowercase or uppercase We take a look at the fruit and we say give me a lowercase copy of that And then we take the return value from this and we store that in x and that's how x becomes a lowercase banana But fruit is still the original one. So fruit has not changed compare and contrast that with A list though here we have a five item list two fourteen twenty six forty one And we're going to do the sub two position And the sub two is zero one two So that's that one right there and we're going to assign a 28 into it So that 28 is going in here going to wipe that out and put 28 in so we can do item assignment in lists By putting a bracket syntax on the left hand side to say don't just put it in the variable Put it in this position within the variable So that's what that's doing and when you print that out to 28 everything else is unchanged I mean the whole list is there there could be a thousand items in the list and then you're changing the second one We have a function called len we've been using this len function all along To take a look at how long strings are it counts the number of characters in the string So that's a nine character string if we have items in a list Len tells us how many items there are it's not like how many characters there are it's the number of things And each thing doesn't have to be a number it could be a number or a string or even another list And len is the way to say hey how many things are in there There's a function that returns a list of numbers and we use it As we'll see in a second to construct specialized loops to go through lists So let's take a look at this range function just for a minute So range takes as its parameter the number of numbers that you want returned So I want like I'd like a four item list with the numbers zero up to but not including four And so it just turns out that that is really useful for constructing Four loops that are counted for loops that's go to zero to the one to the two as compared to the you know the definite loops that go Each way go through each one And so it's a common thing to say okay We know how many things are in this list there are three friends And if I put combined range and len so I take len friends, which is three And then I take range sub three I get zero one and two And so the interesting thing is this zero corresponds to the first one One corresponds to the second one and two corresponds to the third one Okay, and so we'll use this to construct loops especially when we need to go through a We need to go through an array and remember what position we're at And so here's just an example of two different loops This is a a four loop that's just going to go through whatever's in this list So friend is just going to take on the successive values And so it's going to print out these three things just as you would expect And if you don't need to while you're going through the loop know the position your relative position from the top in the In the loop that's that's okay, but sometimes you want a little more sophisticated loop And instead you want to be able to Loop through where you you know the position And so what we do instead is instead of looping through that list itself We do range len friends, which gives us Zero one two and then i takes on the successive values zero one and then two So this loop is going to run four times and i is zero the first time And we might even just look up the value inside that Subzero value so we get joseph the first time so prints out happy new year joseph goes and i becomes One now and so it gets it gives us glad and that prints out away you go So if you look at these two loops If you look at these two loops they really do the exact same thing The only difference is is this we allowed the four to find its way with the iteration variable through And here we created an our own i variable that went through the positions and they're dense There's no gaps in here. So it's zero through zero through two that it goes through so these two are equivalent There'll be times when you'll want to use one in the other I tend to prefer the first one because it's prettier as long as Um as long as it works for me So that gets us started with loops will be back in just a bit Okay, so we've taken a look at loops and now we're going to just take a little A bit of a look at some of the operations that you can do with loops Um python has this as we'll soon learn object-oriented approach to its operators And the plus can add strings and it can add numbers and clean floating point numbers integer number strings Etc. And so the plus uh similarly works this way with uh lists the plus looks to its left and looks to Right and says what am I adding and in the case that i'm adding the list one two three in the list four five six It concatenates them together and this way it sort of functions like a string And so we get one two three four five six It's just concatenate list this list to another list and it doesn't change a or b just like in any kind of assignment statement Uh calculations on the right side don't change the variables and they produce a new variable and then assign that into c You can also also use list slicing and it's it's easy to remember if you remember how strings work list works exactly the same way So it's you know, of course, it's a little tricky. The first number is the starting position. They started zero So one is right there. So it's the zero position the one position start at one Right, but go up to but not including three. There's One two three so this goes up to but not including three and that's why we get forty one twelve out of that So up to but not including i'll just say that over and over and over again If we do uh, you can leave the first part out You can leave the first part out here and you can say oh up to but not including four So that starts at the beginning goes up to but not including four and so that's how we get that piece right there we can say, um Start at the position three zero one two three start at position three and go to the end Now the fact that the number three is in here is sort of irrelevant Three to the end is those three numbers and then you can do the whole list with slicing as well Again, these pretty much are the exact same examples. I used when I was doing strings They're pretty much the same There's a number of different methods and you can look up all the documentation in list I often just use the dur command to remind myself of them a pen will look at count Looks for certain values in the list extend adds things to the end of the list index looks things up in the list Insert allows them the list to sort of be expanded in the middle pop pulls things off the top remove removes An item in the middle reverse flips the order of them and sort puts them sorted order based on Based on the values So let's look at a couple of these Um So if we build a list from scratch We have a way to ask for an empty list. There are a couple of different ways to ask for an empty list We could use just two square brackets next to each other But this is a form we call the constructor form where we say hey python make a list In this case the word list is like a reserved word to python. It's really a reserved class, but um Say list parentheses says make me an empty list and then assign that list into stuff So stuff is now a it's a list of object. It's a type list, but it has nothing in it And then we can call the append method stuff dot append And stick book in and then we say oh And that knows how long the stuff knows how long it is Where the end is and how to add something to it and then add a 99 to it and we print it out We got book at 99 reminding ourselves that lists while they're often the same types of variables Same types of values in the various positions in the list. It doesn't always have to be that way Then we say oh, we'll stuff that append cookie. You can keep on going and then we end up with three things um and the cookie We have an in operator Uh works pretty much like the in operator in a string Uh is nine in my list and that's pretty simple and the answer of course is yes nine is in my list Is 15 in my list looking through no, it's not 15 is not in my list Is and then there's the not in operator think of that as kind of like one operator Is 20 not in the list and the answer is since it's not there is true And so that's a way to just you know, it's kind of like starts with or in for strings Uh same kind of stuff Lists are in order and they're sortable and so this is something that we take good advantage of Um a lot of what computers want to do is sort stuff, you know Look all these things up append them and then get them sorted And so there is this method inside of friend of inside of list That's just the sort method. So here we you know put three values in zero one two positions zero one and two joseph glennon sally And then we tell the list to sort itself and then we print it out Now this is actually sort of the list in place, which is different like than upper and lower because if you remember Strings are not mutable, but lists are mutable. And so you say hey just sort yourself Okay, and so just sort yourself and then it sorts it and then it's in alphabetic order glenn joseph and sally I happen to be clever. I only put strings in there and I put my uppercase and lowercase in a very consistent pattern But the list has changed and if I look at list sub one that is the second item Which is joseph that prints out right down there There's a whole bunch of built-in functions to help manipulate list The other things I was showing was uh method sort is a method that's part of list But there are other functions that take list as their arguments Um, we already talked about the lend function tells you how many items there are There is Pretty obvious max. It says go through and find the largest Men go through and find the smallest Some goes through adds them all up and we can say let's do average by taking the sum of all of them and Dividing it by the length and you might think to yourself. Oh, wow I wish we'd have known this a few chapters back when we were having to write all those loops to do max men some Largest smallest etc You can kind of think in your mind that inside each one of these functions Is a loop that does pretty much what you did in those chapters and part of the Reason we did that back then even though these things were here was they're kind of easy loops to understand um, and so uh, those are there and And basically there allows two different ways of building loops to do the maximum minimum now it's not necessarily all that much easier to uh To do something Using these because you either can do them the old way Or you can do make a list and then use these functions So let's take a look and I'll just say that these two bits of code are doing the exact same thing And what they are is they're implementing a program that's going to repeatedly ask for numbers until we type the word done And then it's going to compute the average and tell us what they are and so using sort of the stuff from Uh the loop chapter we start with a total variable and account variable set them to zero and then we Read a number we check for done to break out But then we convert it to a floating point value And then we say total equals total plus value and count equals count plus one And so this is going to run over and over and over again However many times we're going to do this and then it's going to pop out and when it's done It's going to have this value of total the running total will become the overall total divided by count and it'll print the average out Okay, and so that that's kind of how we would have done this before we knew how to do this with lists Now let's take a look at the other one In the other one we say let's make an empty list remember This is that constructor syntax that says to python make me an empty list and assign the empty list It has nothing in it right, but it is a list has nothing in it into the variable num list Now we're going to write another loop We're going to this part here is the same these three lines Read the number if it's done quit and convert it to value But instead of doing the actual calculation right now what we're going to do is just append it to the list So the list will start out empty Then the three will be in the list then the nine will be in the list then the five will be in the list So we're appending each time through the loop. We're appending into the list So we're just growing the list every time I read a value instead of actually computing something with the value that we've got So either in either case we get value and in one case we append it to the list And then finally it finishes the break happens and then we just say oh Hey, python sum up everything in the list add these three numbers together And then take the divide it by the length of all those things and you'll have the average and so these two things give us Exactly the same output Now there is one difference if there was like one million or one billion numbers They actually have to all be stored in the memory simultaneously whereas here It's actually doing the calculation Of the billion numbers and not using up so much memory for most of the things that you're going to be doing The difference in memory there is a difference in memory. This uses this one here uses more memory But I can't draw very well more memory Uses more memory, but Uh, it doesn't really matter by the time it's all said and done and so for you This the the difference between these things is not all that significant But it's important to understand that they're just two techniques to accomplish the same thing with lists So now we're going to wrap up and talk a little bit about how strings and lists are related They're sort of related in that they both have zero base things and we use the square bracket operator To do various things But there's a lot of situations where we're looking at our data and we're combining the use of lists and strings So let me show you the first thing probably the coolest thing We're going to use it a lot the rest of the class and that is the split function So let's take a string. We've got abc here. It's with three words What we're interested in the fact is that there are spaces in this word And what split does it says, you know, I'm going to look through this thing I'm going to find this and I'm going to break this into pieces And I'm going to return you a list of the separate individual pieces So read look for blanks and break it in pieces and give me back the pieces So I'll print these out and now you see that it's a list With three items with three words the spaces are gone, but it's given it to us So it's like split this into words, please and give me the individual words and give me a list of individual words Rather than a big long string with spaces in the middle of it And that is a quick way to go from a line and and It's really common a lot of things we're going like go get the second thing or the third thing or whatever So the splits really nice because then you can just grab stuff And so you say oh, how many things did I get? Well, I got three the len function tells us that and I can print the first word I got Which is and with the sub zero and that'll be like with will be the first word because that's the sub zero position So I read something I split it I can say there's three things and I can look at stuff the first word basically without really knowing much now if you remember Earlier and we'll we'll see this we used find and Slicing to do a similar kind of thing, but people tend to prefer the split And you can you can you know oops go back You can also then Loop through them So you can split these things into stuff as a word and then go through the with w and then it's going to and it's going to go through W is going to take the success of with three words And so you can make a loop by reading some data splitting it and writing a for loop And then it's effectively going through the words in that line of data And so that's a really powerful concept that we'll use in a lot of the programs that we're going to write Just a couple of bits about this and how it works Split with no parameters here. It looks for spaces, but it also treats a bunch of spaces as a single space And so it's pretty smart about that and so even though this has a lot of spaces between lot and of You only see a lot of all the spaces are gone. It does something special about spaces It's really white space so tabs or new lines or other characters would also qualify In split basically Now you don't always have to split based on spaces and a lot of data that you're going to run into You're going to want to split on something else And so here's some data that looks like we're using colons to to separate the first second and third piece Now if you just call split splits looking for spaces And so split gives you back a list of the things broken apart with spaces But there's not a single space in that line and so we get a list see it's a list But there's only one item and the semicolons are sitting there split doesn't go like Whoa, this looks like it should be semicolons, you know splits job is to use spaces and split the string based on spaces okay, but Given that this is something we like to do you can tell split what character you'd actually like to split on Now it's not quite as clever when splitting on something other than spaces It doesn't understand that you know, if there's a bunch of semicolons in a row It still thinks of those as splitting points to split But in this particular case where there's no spaces that you know, and it's going to split that So it says split this based on the semicolon based instead of being based on the um the space And so if that you take a look at what comes out of this We split on semicolon now we have a three item list And we get first second and third and a lot of your data comes out of some logging system or some router status updates Who knows what you're looking at but the delimiter is often Something other than space and you can do that with split So this is a useful thing when parsing things like our email address, right? We wanted to get things like the email address this second piece off of the line And so we can use split to take advantage of this and so here's a little loop That's just going to print out not the email addresses, but instead The day of the week we're going to print the day of the week out for all these things. How do we do that? Well, we can observe really quickly that If we split based on spaces We it's the 012 it's the two positions so we can quickly write a bit of code that You know opens the file Then loops through the lines. We do this all the time now A strip takes off the end of the new lines We can check to see if it starts with from space, right from space is our key So we're ignoring We're ignoring all of the lines that don't start with from sprays But then we find the line that starts with from space and we split it and then we just print out the second word And so we get the second word of the lines that start with from and that's so how this thing works now Sometimes we want to dig into deeper and we will take something split it and then Split another piece of it again with a different delimiter So let's just say that the thing that we want to achieve is getting the part after the at sign for email addresses And we did this with again find and pose and stuff like that, but you can use split to do this as well So the first thing we're going to do is we're going to take this line. We're going to split it based on spaces, right? chop chop chop Chop and the fact that there's an extra space there doesn't matter split happily just like zooms through that And then words sub 1 0 1 2 word sub 1 is this email address So we'll put that in a variable called email and so email will be a string That's just this so in two lines We've pulled out this second address into a variable Then what we're going to do is we're going to re-split that We're going to take this string We've got and split it based on at sign because we know it's an email address So we get a new set of pieces the first part is the person's name and the second part is The host name that their email is hosted up And then what we can do then is we just happen to know that We just happen to know that these this is the zero item and this is the one item so we can get at that So the interesting thing of going here if you think back to how we did this before with find and pose and all that stuff It's really a lot cleaner and we don't For me I can I can look at this after you understand it and it's easy for me to understand that it's correct Whereas that pose stuff you got to add one and start the second find after just just remember that And this is a lot cleaner way and this is a more typical way of pulling this kind of information out of a line So in this chapter we've talked about lists we've talked about the concept of collections That's our first data structure. We're not just doing algorithms that we kind of know algorithms now But now we're going to do data structures and the next this chapter and the next two chapters are our foundational data structures And then we'll like everything we'll make more complex data structures by composing those data structures Together we've looked at how strings and lists connected together and how split works and these are all really powerful tools That we're going to use going forward Now we're going to take a look at how we would write some code to do some parsing Read some data As a matter of fact, we're going to read through our famous mailbox data Look for lines that begin with from from space And extract the third word. As a matter of fact, we already have some of this code already written We're going to debug it. We're going to look at code and we're going to debug it. So here we go It here we have it and it's a it's a pretty basic Program it opens a file loops through the file throws away the white space Splits it into words and checks to see if the 0th word the first word is from and if it's not we skip and read the next line And otherwise if we find a line that starts with from space, then we print the third word, which is word sub 2 Okay, so this is what we've got and we carefully saved this file Into the same folder that we've got uh ex 0 8 And so let's go ahead cd desktop python for everybody ex underscore 0 8 and so this is Some files we got our day of the week python and our inbox short. So that's sitting there. Okay And so let's run this program. This is the program we've got right here python 3 d o w dot py and It doesn't work now by now you've seen a few trace backs and there you go So, you know the When you look at a trace back you think to yourself well, I made a mistake And you've gotten pretty good at looking at that line. So there you are you're like this is the line there must be something wrong on this line And you want to change it But but that line is not actually the problem in this particular thing And so you got to be careful sometimes And one of the things that you didn't notice in this one right away is that it actually worked It printed the first line out. So if we take a look at our dataset It found the line started with from space it split it and printed out the third word And it blew up later And so part of the problem is that we don't know what it was doing when it blew up And so the first thing I'd like to do in this kind of a situation is Um find the line and make sure there's a print statement right before it And so I'm print Words colon and then comma wds I want to I want to print right before the line that blows up so that I know I know really when this finally does blow up what was going on in that line So i'm going to run it again Do I forget to save it? No, I forgot to save it. Look at that. See the little blue dot forgot to save it So now we see a whole bunch of output and we see that it's actually doing a whole lot of work before it's blowing up And so you see that it it prints the words out from that first line and prints out saturday Which is exactly what we expect It's the third word in the line and then reads a whole bunch of stuff and it's actually what's doing now is ignoring Let me just put something here I'm gonna say print Ignore So I can keep track of when these lines are being ignored. So let's run it again and have the word ignore pop up Right and so it's doing a lot of ignoring It it prints finds these words prints out saturday Reads this line and ignores it reads this line and ignores it reads this line ignores it So a lot of stuff's going on here that you might not realize And so We have to take a look at what the problem is and so the it is now blowing up words sub zero And now we can scroll down and we can look at exactly what happened right before the trace back So we really now know Exactly what happened before the trace back and the interesting thing is Is that there is an empty Empty string. I mean empty uh empty array. There's an array with zero items. So I'm going to print the line out to print Line colon now. I haven't changed my program at all. I'm just trying to figure out what's going on here So I'll save that and I'm going to run it And we've got a lot of stuff and and it's still working it reads a line It reads a line splits it into words and then prints out saturday, which is the third word on the line Now here it reads a line and this line is a blank line And it has because it's a blank line the split retains returns no words And that's what blows up and the problem now is oh wait a sec list index out of range So word sub zero is not valid, which is the first word when there are no words so This is a statement that works most of the time now. You might think oh, I want to just put a try and accept in there well The right thing to do is to say to yourself. Oh, wait a second if The I don't have enough words if the length of the words is less than one Continue So basically it's going to come through here. It's going to split it And if we don't have any words meaning it's a blank line, then we're going to skip it. So let's run that So now this ran all the way to the end. It did a lot of stuff and it did not blow up specifically didn't have a Trace back another way to protect this would be to We'll take this part out. This is called a guardian pattern right guardian pattern because this This is dangerous This could blow up But this it will it won't blow up if it makes it past here and it won't come through there Under the conditions that are causing it to blow up another way to do this might be to protect it as follows to say oh, wait a sec If the line is a blank line No Continue So now what we're going to do is we're going to skip blank lines. I even say this Print skip Blank So if it's a blank we're going to skip blank and keep going This will skip blank lines. It'll come through here And this will skip lines that don't have from but because we're not processing blank lines word sub zero always works So I can run this code And it works again. So here we have a blank line. We skipped it here. We have a blank line. We skipped it Now here we had a non blank lines. We parsed it, but then we ignored it And then up here we'll find a from somewhere Let's find it from It comes No, there's ignore ignore I got too much debug print. I can't find it here. I'll just hunt for from with find Okay, so there we go There it's from and we print the thing out. So we're getting a lot of Uh, a lot of extra stuff. So I'm gonna I'm gonna comment out some of these uh, Debugs And I'm actually just going to get rid of this whole skipping of the blank line I'm gonna do with the words. I'm going to go back to the the guardian we had before if The number of words that we got len of words is less than one continue Okay, so now this is going to be a working program Oops, I gotta take another print statement out Got to take another print statement out. We sort of know what we're doing here Okay So this looks like a pretty safe thing this guardian is protecting this dangerous I get rid of that one too This is the words that could was our trace back And nothing else in this thing changed from when we started except we've added this little guardian Now the interesting thing is if it comes through here and prints words of two What happens if somehow we find a line that has from is its first word And there's only one word on this is going to blow up so we can make our guardian a little stronger And we can say, you know what? We're gonna we're going to skip this line if it doesn't have three words in it So it has to have at least three words And if we see less than three words, we're going to skip it and that just makes the guardian a bit stronger And so the program works safely and and you see these things where you sometimes you want to check to see Reasonable that your assumptions about the data are reasonable and skip things where the data is not reasonable Now there's so that's one guardian pattern Let me show you a slightly different way to do this and this is with an or statement So i'm going to take this code Copy that and put it here with or Get rid of all this stuff This is the guardian In a compound statement so Or saying is if there are less than three words On the line or if the first word is not from Continue now we're doing this in order Because the way it works is or is true if either that's true or this is true But if it knows that this is true, then it doesn't bother checking this and the checking of this is what blows up What causes the trace back? So if we flip this order it would fail if we do to do it in this order it will work So let's let's do this one right That works But if I if I get this backwards It's going to check this before it checks this And we're going to go back to Failing again So you got to get the order of these things right you the guardian Comes before in the or The guardian comes before and if this is true, then it doesn't check this This is called short circuit evaluation where it knows that as long as this part is true It doesn't evaluate this second part And so now we have a guardian in the compound statement Uh, you will you'll see this a lot Sometimes if it's more complex you do it in multiple statements or you sorry fall through check for sanity check for sanity And uh only run the code So I hope that that was uh useful to you Looking a little bit about how to debug where you don't just start Chopping on the line that had the problem. It's not always that line because we never did change that line Although we did change it a little bit at the end. We added this guardian here, but we also fixed it without it Um, you know sometimes you add some print statements to figure out what's going on before you just start chopping on that line Um, so again, I hope this helps. Thanks Hello and welcome to chapter nine now. We're going to talk about python dictionaries. Python dictionaries are Probably the thing that most programmers love the most about python Because they're very powerful. They're like a little in-memory database It's the second of our kinds of collections and probably the best collection To review what a collection is it is a situation where we are going to have a variable like a list or a dictionary That we can put multiple pieces of information in rather than a single piece of information And of course prior to collections, we would put something into x and then we would put something else into x and it would be over written And uh now with lists we can append things on to the end and so if we compare lists and Dictionaries the list is sort of the organized version of the collections It everything stays in order you add something it always adds to the end You take something it sort of compacts itself It's zero through the n minus one where n is the number of items And so it's very organized kind of like a Pringles where the potato chips are nicely stacked Dictionaries are messier You can put things into dictionaries. There's no real sense of order in dictionaries Everything has a key So you sort of throw things in and they kind of mix around in there somehow and you pull things out based on the key It's like you you sort of stick a label on it Um, you know where you say, okay, I'm going to take this thing And I'm going to put chuck on it and I'm going to take these sunglasses With the chuck label and I'm going to throw it into the dictionary and I'm like, hey, give me back chuck I'm like, oh, here's your sunglasses because you mark everything This is like the key This is the value I took a pair of sunglasses and I threw it in so it's kind of like a purse or it's a sort of like a mess And so the idea is is you have these labels that you put on everything that you're going to throw in Like I'm going to put so if I won't stick to my keys You know what else I got here I'm going to stick a label on my pen A chuck label and I'm going to store a pen in my dictionary with the chuck label And so it's like having a purse or a bag or a backpack where you have things labeled and you can You can throw things in and label them and you can shout into your bag and say give me the calculator or give me the candy Or whatever that is that you have labeled them You have to come up with the labels and then you can use the labels to get things back out And like I said, they're probably the most powerful thing and and they're basically this concept that's generally referred to as Associative arrays, which means they're like lists, but they have these keys And so the associative means the association between a key and a value Whereas in a list there's a position in a value the position is less powerful and less flexible Most modern programming languages have this notion of associative arrays if they don't they're sort of unpopular because Once you get using them or like, whoa, they're so powerful If you ever find yourself in a language that doesn't have them You'll you'll freak out they get have different names like property maps or hosh maps or property bags Depending on the language you're using but they all are the same thing. They're key value pairs So the idea of a dictionary is that or the idea of any collection is putting more than one thing in And then the difference is is that you have ways of of Indexing it so this basically line says let's make ourselves a dictionary just like we constructed an empty list And I want to store 12 into this dictionary and I want to label it money And so on the left hand side when we use this money, that's the label that we're going to give it And so 12 is being placed in the dictionary. That's like taking the 12 throwing it in the dictionary with a label of money. I can't yeah Three's going in with a label of candy and 75 is going in with tissues And we say what's in there and there's no order to it and sometimes the order can even change inside of a dictionary Although there are more advanced versions of dictionaries that maintain some kind of order. But for now, let's just not worry about the ordering of them If we say what's in there you say, oh, there's three things in there. There is 12 75 and three And stored under the keys money tissues and candy respectively. We can ask Using the index operator. What is per sub candy? That's like saying hey, give me back candy and outcomes Um, uh, the number three which is that we can update stuff So we can say like go grab the candy version add two to it Make five and then store that back into candy. And so now we see that candy has been up to set up to be five and um and so If you look at the difference between lists and dictionaries, they both can have new items added to them We haven't talked a lot about deleting but items can be deleted from them The difference is is the indexing mechanism how we look things up how we store things And how we look things up. So we make an empty list. We make an empty dictionary We add 21 to the end and we add 183 to the end and we ask it and says Oh position zero is 21 position one is 180 183 do we don't see the positions when we print it out because it's sort of implicit Here we're going to in mark 21 with age and stick it in and mark 182 with course and stick it in And then we're going to print it out and there we got course and age mapped and we can add 23 and stick it back in age and that overwrites. So the 21 becomes the 23 We can do the same thing in a list except we say lists of zero because in lists the indexing Is position and so this 21 becomes 23 And again, you just look at them and you can think of each of these as Pretty much doing roughly the same thing except the indexing mechanism The values are the same but the keys are different. So in lists the keys are always the position And you don't get to assign those other than the fact that the order in which you put them in implicitly assigns a position and in dictionaries the the key is a string You can actually use other things. I use strings a lot in this lecture But that's just to kind of keep things simple until you get good at it You can actually use numbers as the dictionary index The dictionary keys if you want, but the values are things you put in and manage in those dictionaries So we can just like lists. We have dictionary literals And what's nice about dictionary literals is that they use the exact same stints acts as the print out And so it starts with the curly brace ends with the curly brace And then has a series of key colon value key colon value key colon value And this is sort of the associative array bit. We are associating one with the key chuck We are associating 42 with key fred or associating Chan and 100 then we printed out it kind of looks exactly the same And so the print statements in python are are nice in that you ask What's in a thing you show the stuff and it shows you in the syntax that if you type that into python that would be how you do a constant And if you just say Empty array you see you see me also do di ct This is a constructor where you say make a new empty dictionary. This is an empty dictionary constant These two things are pretty much the exact same thing. This is a shortcut to doing this The the the empty curly braces is a shortcut to do the Construction so up next we're going to talk about sort of one of the really common applications of dictionaries and that is Counting so now we're going to talk to you about one of the common applications of dictionaries and that is making histograms It's counting the frequency of things And so if you think of a histogram as you know, it's a little graph and there is You know a how many a's how many b's and how many c's and there's a histogram So there's this many of that and this many of that and these are like buckets These are frequencies and this is how many times it happens So a histogram, but we're going to do this thing where we're going to take count people's names And we're going to kind of count how many that we see But the interesting thing that we're going to solve just like many of the things in the computer Is we can't just sort of look at the data. We've got to look at the data iteratively one piece of data At a time, so I'm going to give you a little problem Okay, I'm going to show you a series of names one at a time And I want you to count for each name make a little bucket and then keep counting how many things for each of the different names Okay, you'll notice that you have to start with one and then you move across So just watch this and tell me How many How many what's the most common name of the set of names? I'm about to show you and uh, how many do we see? So how many what was the most common name and how many times did you see it? That's the question Now here comes the reveal So for humans, it's so much easier for you to just look at this and you think how did my brain look at that? And you're like, okay, what is pretty common? Oh Maybe Maybe Chen is common. Oh Chen Chen Chen No, maybe Jen is common to one two three four. Yeah that anybody else have mark wad's got three C seven and so you'll notice how our minds is come without computers. We just sort of like bounce Branch and bound we have hypotheses and then we decide it's Jen That's it and there's four of them Now How did your brain think about this as we were going through them one at a time? Well, my guess is you if you really had to do this a lot you would make a little picture like this And then what you would do is if you saw a new name You know x y z you'd add it to the list and give it a tick mark of one And then if you saw like c seven again, you'd give that a tick mark And if you saw x y z again, you'd make a tick mark And then you'd make you'd keep adding to these tick marks, right? And that's how you would do it and you wouldn't like many of the things we do in a loop You wouldn't really know what the most common was one until the end And then you'd sort of take a look at these numbers and you say, okay, that's the most most common number And then you'd you'd be done, but you have to watch them one at a time. You can't just bounce around And so That's how we're going to use dictionaries to achieve that Again instinctively as humans we just look at the stuff But if you add a million things you probably want to write a python program and use dictionaries And so this is the idea and there's two basic things that happen One is the first time you see a name you gotta say is this name there already If it's there already you really just want to add one to it, right? That's the adding of a tick and or you want to see for the first time You know blah blah blah blah and give it a one and so you can use the name As the key and then one is the value and then first time you see chen you stick one in there And so at this point inside the dictionary sort of dynamically adding as soon as it sees a new name It adds another slot in here But then if you see the same name again like chen again Then you end up with a one add one to it and so it's two and so at that point chen is two And so you can see how you can both extend the dictionary by encountering a new name or Adding when you see a name that you've already seen before The problem with dictionaries is like everything in python There are rules about what you can and can't do and one of the I think kind of frustrating things about dictionaries Is that you can't just look for a key that doesn't exist? So this is a fresh brand new dictionary We do a constructor there and we print out sub csev and boom it blows up and that's bad But we can solve this by the in operator the in operator we've used in the for loop So you use it in lists we use it in strings So that is a question. It's a saying is csev in ccc. Well, this is this empty one And so it is no it is not csev is not in ccc And so not using this in operator we can avoid the trace back We can say if it's not there put it in if it is there add one to it and that leads us to This bit of code Okay, and that is the kind of code that we're going to build a history. This is going to histogram code, okay And so this is going to have name is our iterator names. Sorry. I made them singular and plural. That's that's nice But so name is going to be csev chen csev gen. Normally we'll be reading this from a file But for now we keep it on keep it easy We're going to go through this and we're going to have counts as our dictionary So that starts out empty and we're going to do a simple if then else every time through the loop If the name we're looking at is not in the dictionary already is the key Then set it to be one If it's not go get the old value Counts sub name and then add one to it and stick it back in So this is This line right here is new adding a new thing and this line right here is adding Some things to existing things and you do this long enough you start with an empty one and you do this long enough At the very end it will print out The histogram that you're looking for histogram you're looking for and so you say oh We've seen csev twice gen once and chen twice and so that's the idea and so this can run a million times if you want now this notion of checking to see if a key exists and doing one thing if it doesn't exist and doing another thing if it does exist is Such a common practice that the dictionary Object has this method called get that'll that collapses these four lines into one line And so the idea is you're going to do one thing if it's in there and you're going to retrieve the current thing Otherwise, you're going to pick a default value in this case. We'll pick one. I mean pick zero This is like the default right meaning what is not there And if you say counts now counts as a dictionary dot get that's like string dot upper That's a method you give it a key and then a default And if the key exists you get back what's in the key if the key doesn't exist you get the default Okay, so and with no trace back this works. So so the best way to think about this is Those four lines are equal to that one line because x is either going to be whatever was in there before if it exists Or it's going to be zero Now the nice thing about zero is the next thing we're going to do is we're going to add one to it So that that's going to get us to one so collapsing that loop that we saw before Collapsing that loop we can we can make it just a one line loop and this will become An idiom this will become something that you will get used to and you will Use over and over and over again and after a while right now you're looking at it. Boy, boy That's a lot of syntax and semicolons and whatever After a while you just type this and not even think about it. It's an idiom It's basically included in this idiom is how to both create new entries and dictionaries and update existing entries By adding them adding one to them so Everything else and this is the same name is going to go through these five values and we're going to say count sub name equals Counts dot get name comma zero plus one And so if for example, this already has a one in it Then this is going to be one plus one becomes two if it's not it's going to be zero plus one equals two And so this is the idea of if new set it to one not zero set it to one Because the first time you see something the count should be one not zero So that's why we make this default now the get can be used for anything It just so happens that zero is a common default because it's really common that we're using this to basically make a histogram Right a little histogram of a b c Right and so we need to make a d in the but then the histogram has to start at one so that's basically the simplified counting with get and You know, there's a lot of things that we're going to do inside of Python that do have to do with frequencies and how many times certain things happened and this pattern Is a really good pattern to absolutely know So now what we're going to do is we're going to Switch from just looping through strings instead loop through files And we're going to it's going to take a little bit of work because we have to open the file And we'll bring a lot of things together at this point So here would be another task and that is here's a bunch of text from the book And uh, you can just split this into words and count and find out what the most common Common word is and how many times it uh How many times it occurs so go ahead and try to do this for a second feel free to pause Actually, don't bother pausing. This is too hard. We should write a program for this It's not it's not easy humans. Don't like this. It makes you concentrate Um, and so here is a counting pattern where we're going to take a line and then later we'll read this in a file and so We're this is just an adaptation improvement of the previous thing. So we're going to start with an empty dictionary We're going to ask for a line of text and read it in and then we're going to use split So remember the list of words. Well, what we're going to get here is a list of words We'll print it out and we'll run this counting. This is the this is a little loop For every word in whatever this was we're going to do this idiom of Add either adding a new entry or adding one to an existing entry and then printing that out So let's take a look at what we get there So if we run this We can give it some text and I've got this this would be all one line And then it splits it into words and you see that these words here are split split split split I mean that's strings and splits remember strings and lists And split and so now the counting is going to go through this list the clown ran after the and it's going to build a histogram the Clown, you know one clown the up up up of these things are going to go up Right, that's this histogram and then when it's all said and done We end up with the histogram and so counts is the dictionary that ends up with a histogram And we can side by inspection see oh the is the most common word and there are seven of those right So if we sort of take a look at this we start out we make a dictionary we read in a line of text the text goes in We um and then we split that and we print the words out. So these are the words Right, then we have a for loop that's going to loop through all those things and then produce a dictionary And when we print the dictionary out, that's what we're going to get and the seven, okay So that's one line of text That's how you walk across the words in a line of text after you split the line into separate words So now we're going to look at ways that you can loop through dictionaries We just produced a loop that can build a dictionary But now we're going to going to look at a dictionary and so we'll start with a very very simple example And then we'll work to a slightly more complex example. So here's a dictionary just the constant chuck is one freds 42 and jans 100 And so we're going to use a definite loop with the four four key and counts now It doesn't have to remain key but key is then is a good name because these are these are keys and values k v k v keys and values I just mentally think of this as keys and values and keys and values So this iteration variable is going to work walk the keys. It's not going to walk the values It's going to walk the keys chuck fred jant not necessarily in that particular order as you see it goes jant chuck fred Because just because I typed it in in this order. It's not like a list. It doesn't stay in that order It might move around a little bit as we add data to it or as we set the data up And so you can in the loop you can get the key and so that's what prints out the chuck jant chuck fred But then you can also get the corresponding count for each one of these by just pulling it out of the Pulling it out of the array I mean pulling it out of the dictionary right and so we can pull out the corresponding value And so we print out jant 100 chuck 1 fred 2 and that runs this loop three times So if you just use the in and you give a dictionary here Remember all the different things we've been able to put there on the end of a for loop And dictionary is another thing we can put on and we get a list of keys Now there's a couple of methods that allow us to get the keys And so we have you know, we can say turn this into a list and we get a list of the keys So this is a dictionary the same dictionary we get a list of the keys You can also get a list of the keys by using the keys method So let's take this dictionary jjj and give me all the keys which gives me a list Which is kind of the same thing and then we can ask for the values And they give me just then the values extracted out of this dictionary. So that's nice Um now the one thing is is that while I said you can't predict the order if if in two statements You ask for the keys and then the values they at least come out in the same order Even though you can't necessarily predict the order that they come out they come out in the same order And then there is a third Thing that we can do and that is list of ask for the items and we can say give me the items and that gives us a list this is our first really kind of composite combined data structure where it is a list a three item list Zero one two and inside that there is what are called two tuples jan maps to 100 chuck maps to one fred maps to 42 Coming up next we're going to have a whole chapter on that And so just take a look at that for the moment and we will come back to that in some detail later This whole items idea that gives us back a list of key value pairs Because it's not just a list of keys or a list of values It's actually a list of key value pairs allows us to write in python a very clever and elegant loop What we can do is actually this items gives us back each Item in the list has a key and a value and we can actually take two iteration variables for a comma bbb This is two iteration variables. And if you're coming from another programming language, this is super cool And it's a python only feature. I've never seen another language that's capable of doing something this simple and that elegantly So what this basically does it says we're going to simultaneously advance these two iteration variables So this is going to be the key in the value the k and the v Key in the value is going to be chuck one then it's then they're both going to advance fred 42 Jan 100 and so that means in this simple loop if we just print them out We're going to get the key value pairs of course in the order And so it's sort of a and bbb simultaneously walk down These key value pairs and so that's really pretty and it makes for a very succinct loop It's the syntax is a little sort of disquieting when you first see it But it's a super elegant thing and you just have to say items if you If you don't say items you just get the keys if you say items you get the key value pairs and You have to have two iteration variables If you don't have two iteration variables and use items it'll complain and say what are you doing? I'm giving you two things and you don't have two variables to receive them So two iteration variables and items are basically related Now we're going to take a look and this is code that I showed you perhaps many weeks ago About I said this is a little story about how to read a file and count all the words in the file And now we're back to it and at this point you should understand every single character of this program Every single concept of the program You should literally stare at this and look at it code it play with it until you absolutely understand it So let's take a look Again, I showed you this weeks ago So we're going to ask for a file name Then we're going to open the file name Then we're going to make an empty dictionary again This is all stuff you've done before and then we're going to have an iteration variable That's going to go through the lines in the file Right, so line is going to go line line line Then we are going to split that line each line into words chop chop chop chop So that's words is the list of the words in one line We're inside of a loop that's going to go through all the lines And then what we're going to do is we're going to write the have the word iteration iterate through each word in the line And then we're going to do is take each word in the line I'm going to do this histogram Right, so we're going this this is going to run not only just for every line But for every word in every line So we have a nested loop for every line then we split it and then we go across the line So it's almost like a typewriter where we go That's what we're doing So it's like the outer loop is going down down down the lines And the inner loop is going across across across the words And eventually we are going to see in this middle in this last line Every single word in the file and we're going to do the counts get word plus one Which is our magic histogram making line that if you don't remember what that is Go back a couple of slides. I just talked about it At this point in the code and it's important to be able to draw these lines at this point in the code You have the histogram and it's in the variable counts Now we want to find the largest one Now we have written list we have written loops that can find the largest in a list But now we want to find the largest value in the key value pairs of a dictionary So we're going we're going to start with the we're going to know what the largest count is And the largest word of the has that count and we're going to set them both to none Because we're going to prime our loop we have to prime our loop and we're going to say to none And so then we're going to write one of these cool things that says for word come account So word and count are going to go through the key value pairs because we've got items here So it's going to go through the key value pairs loop through each key Whatever it was there could be a million words in here We're going to go through every one and what we're going to do is we're going to make sure that key big count Is the current largest count we've seen so far and if it's none Well, then we haven't seen anything or the current the count We just read is greater than the the big count so far We are going to jump in and this is sort of like oh, this is a new new personal best count for this particular data set And so we're going to remember the word in big word and we're going to remember the count in big count So this is just a max loop It's a maximum loop with the extra thing that we're recording in addition to what count is the largest What the word that was associated with that count recording it So again, this is a starting part of the loop We're going to do some work and then when we exit the bottom of this Big word is going to be the word that is the most common and big count is the number of times And so if we run a file we say oh in that file too is the most common word and it's 16 times If we run the clown file Well, the is the most common word in seven And so this now is can and this could have a very large file and give you the most common word and so that is Sort of a really good application of dictionaries. So dictionaries are the most powerful. Well Of the they're the most powerful collection. We've seen so far Um, it is good to see both lists and dictionaries to understand what quest Collections are they are things Inside of python that can handle more than one item inside of it and we'll learn about another collection about tuples in a second Just Understand the get method because that leads to very compact code Um understanding there are various ways to iterate through dictionaries and so we've learned a lot but in the next Section we will learn even more and put these together and do some sorting and do some other stuff And really start to see the real power of dictionaries Uh, this is i'm going to do some coding. It's related to the Dictionaries chapter chapter nine and we're going to do some word counting. That's uh, it's basically right out of the slides for But i'm going to just write the code in front of you rather than uh, have you look at it in the book? So what we're going to do is i've got my text editor up here And uh, let me start by making a new folder new folder For my chapter nine exercise And then i'm going to go and make an untitled file. That was some of the previous one And i'll do what i always do print Hello and save it and save it here into exercise 09 and x09.py So now i have a folder that's in my py4e folder Uh, and that happens to be in my desktop py4e Is my folder on my desktop And now i have all of these subfolders cd ex08 ls is dur on windows ls oops i gotta go up one ex09 ls So i've got that file right there now i'm gonna want to read some files and so i'm gonna bring some files down Uh, a couple of files Um python for everybody code 3 intro dot txt. So i've got this url And i'm going to save it save page as and it's really important that i save it in the same folder as i'm going to write my code Just so that when i open this file, it knows where it's at. So i've saved that one and i'm going to also Take this clown text. I'll use this Uh to make my life simple So i have a real short thing that i can show you how it works And so now if i go back to my terminal I see i've got exercise o9 python intro dot txt and clown dot txt. Okay So Let's go back to my text editor and get started Uh, i will prompt for the file name Uh input Enter file Con space Now i'm going to do something if the length of the f Name that i just read is less than one i'm going to say f name equals clown txt i do this So that i can just hit enter and it defaults to clown dot txt if i want to give it a different name I can so this if i just hit enter at this prompt then this will give me a string that's zero length So if it's less than one i'll just assume that so let me open that handle equals open f name And let's read through it for line in Handle and we'll strip it Line equals line dot r strip to take the white space off the right hand side and then we're going to save print line Again, I i'm not just doing this I really when I write code. I just saved it when I write code I do these kind of stuff all the time just for my own sanity checking And so now i'm going to run python 3 p x09.py Just to test that I want to hit enter now and it's going to assume hopefully clown dot txt if it all goes well and yep it read It read one line. Okay, so that part's working. I'll just leave that print statement in The next thing I want to do is kind of a classic Thing where we're going to go read a bunch of lines and then go horizontally across those lines and words So i'm going to split that wds equals line dot split and print Wds So i'll print that And i'm going to save it and test it. I I really Love to test things over and over. There's the actual line this file clown dot txt only has one line And it breaks it into words and so I have those words. Let's just run it again With um intro dot txt So this will have a lot of lines line line line line lots of lines Every line has it prints out the line and then prints out the words that we split it into Okay, so now I kind of one of the things that I do here Is I want to believe now I sort of can believe everything from here up Like oh, it's going to open the file. It's going to read through the lines I'm going to split them into words and so then I'll just kind of behind it. I'll just say, okay, I'll just I'll just Comment that out now. I need it another for loop for W in Wds now words is a python list It has some number of words in it zero or twelve or whatever was on the line And now i'm going to print out the word Okay And so now it will go through that horizontally. I'll now just do clown dot txt So that you see I i'm not printing the line out That's the words that have been parsed from the split from the line And now we've got this loop now one of the things that's interesting is just to make sure that you're you're going through all the words And I I like a print statement here to know that w is going to successfully take on literally All the words of this file. So if I comment this print statement out and I run it again Clown dot txt that for loop starting from here is every word in that file which happens to only be one line But now if I do the same thing for enter dot txt It's just going to go through the words and in sense by nesting these two loops We're going to hit all the lines and that's a lot of stuff But it hit all of the lines all the words and away we go Okay, so here's where a dictionary comes in I'm going to make a Variable called a di for dictionary and I'm going to say give me a dictionary Now di ct is not something you can choose that's saying make that's that's defining the type of dictionary di is a variable that I chose okay, so The key thing to this dictionary is we're going to make a counter and we're going to use w the word absorb elegant whatever And we're going to use that as the index So the simple thing to do is to say if w is in um di Then we can say w's. I mean The dictionary sub the word which is our key and the key value store of the end of the dictionary Is equal to the value that we had before in that area d sub w plus one And if it's not in there else di sub w Equals one and I'm and I'm I'm going to print Print new So every time we see a new word it's going to say new and I'm going to also then print w and The current value of the counter for w as it's going through Now notice how far in I'm indented this is all part of this inner loop So this is the loop that's going to run every single word Okay, and I'm going to run this first with clown So it runs slowly okay, so it We saw the was new It's and the count is one clown is new Count is one ran as new the count is one After is new the count is one now we saw the again, but now we made the count be two Let's print here. I'll say existing So you kind of see it now in the print. I'm printing this. Let's make it even a little more verbose print w And then I will make it so it prints the it prints the word before and the count after and then whether it's existing or new So we're put a lot of print statements in print statements are cheap Okay, so now we see the word the it's the first time we see it and we set it to one We see the clown the first time we see it. We set it to one. We see ran new one Later on later on we see the it's already in so existing means It was already in the dictionary w as a key was already in the dictionary Okay, and so that's why we added one to it. So the old value was one and then we added d i sub the equals d i sub d i sub the equals d i sub the plus one w is the string the That's what that what that string is Okay, and so So we've made it all the way through and you see is the in this one line occurred ultimately seven times So now I want to print out the contents of this dictionary At the very end of both loops. So I got it de-indent twice And so that will give us the counts And so this is what we get when it's all said and done, you know the happened seven times But it just worked through it was way through Okay So you got that now This is a pretty verbose way of doing this But I did it sort of the slow way to show that there are two situations If it's already there you increment it and if it's not there you set it to one Effectively inserting it right so you insert it and set it to one with this d i sub the equals one Okay, but let's get a little less verbose here get rid of some of these print statements because we kind of covered all that Um Get rid of this line and go back to printing w and di w at the end We'll leave that one in so what I want to do is I want to look at this bit of code right here this if w in di ls The We do this so much with dictionaries that there is an easy mechanism to Do this that combines these four lines into a single kind of contraction And so I'm going to do this I'm going to print Let's put two stars out then the word and di dot yet of the word comma negative 99 Okay, and so this will this this di dot get of the word is the important part The way it is is this is a dictionary Dot get says In its first parameters the key to look up which is word like the or fell or clown or whatever And 99 is the default value that we get if the key doesn't exist So this is an effect an if then else right This little di dot get w 99 negative 99 is you know If it's in there do one thing if it's not in there do something else Okay, so so let me show you how this works and you'll see that the 99 will happen when Okay, so the first time we see the get returns 99 All right, so let's move it over here the first time we see the The is not in the dictionary So this di dot get of the word the in the dictionary Gives us back the negative 99 Okay, and this still is working and so the is one clown is whatever but away we go Okay, let's do it this way. Let me comment this out. Let me comment this one out and run it again So it's a little clearer what's going on Okay So the first time we see the The is not in the dictionary the first time we see clown and we know it's negative 99 negative 99 But here we asked for it and the is one because We've seen it before And so that's just this get mechanism allows us To get the new value or get a value out if the key exists and specify a a um Default if it's not there. So I'm going to go old count Equals ti dot get W comment 90 comma zero So instead of using 99 here. I'm going to just get rid of all this Is what I'm saying is look up in this dictionary get is a function. That's part of all dictionaries Look up using the key w which is the and if I don't get it Give me back zero and so I'm going to say print word comma old comma Old count And now what I can say Whatever the old count is it's either the value that was in there or zero and now I can say new count Equals old count And now let's see new count and I can say Dictionary sub word is equal to new count So instead I'm going to get rid of this if then else then This is basically saying Look up the old count that we have if you don't find one uses zero We'll print that out and then I'm going to say Afterwards I'm print the new count now and so So we'll print the old count Here are some of these blanks print the old count And you can see the old count with the because the doesn't exist was zero the new ones one Clowns old is zero new is one clowns old ran old zero But now we get to the its old count was one and now its new count is two Okay So by using this get and saying if we don't find it we'll assume the count is zero that makes a lot of sense, right? You know If not there the count is zero if the Key is not there the count is zero Okay, so that's what this line does If get the get the value from under the key be uh associated with the key or Give me zero back and then I can take that old number and just add one to it and then stick it back in Now this is ultimately not how we tend to do it. Okay, we tend to Blend this all into one big long statement di sub w equals This part plus one Okay, so that says get the old value From this key or zero and then add one to it because that really combines All of these lines into a single line. Okay So i'm going to delete them now And now we've combined this all into one. What effectively is an idiom? retrieve create update counter All in one line. I'll still I'll still print out in this case. I'll just say di sub w And then we'll see the counter. Okay And so now I'll run this We don't it's it's we have a new but now we see it the second time it's two And so we see car the first time we see that the second time we see car I mean the third time we see car the second time and away we go Okay, and so that's pretty straightforward and so it really kind of typo there so let's just get rid of that And run it with the clown stuff and we get the right data there and Let's run it with intro dot txt And there we go. Okay, and so it's it's tearing out a bunch of words and giving us a dictionary So that was a lot of work to get to this line 16 that has the dictionary in it Now we want to find the The most common word And so we're going to loop through this dictionary and part of it is like once we printed this dictionary out and we verified that it's right Don't worry too much about the code up here, right? As a matter of fact, I can take out some of these print statements And we can kind of trust all this And so now we're going to work on this Okay, now we want to find the most common word now. This is like a maximum loop. So if you recall um We have a whole set of key value pairs communicate goes to two is to two Skills is three. So we have these key value pairs and we're going to loop through and look for the maximum now in a dictionary We can loop through the key value pairs with the following syntax for You know, I would call this these variables k and v for key and value but uh, yeah in The dictionary's name dot items And items is a method inside of all dictionaries that says give me The key value pairs and we need two iteration variables So this is like an assignment statement for k and v k and v take on the successive values for the keys The key and the value, okay? So if I just now print k comma v And I'll take this print statement out And then run the code On oops what I forgot to oh, I fell back into my python two days Any parentheses for my print? So there's clown and it just prints it out and it's kind of the same thing except it's pretty where we're putting each one on a line Okay, so the k the v is the value. So we're looking for the largest value. Oops So the thing is we know that the values are are always uh numbers that are greater than one. So i'm gonna i'm gonna do kind of a A quickie maximum loop largest Equals negative one now in previous times we've seen that this is a bad assumption But because we know these are counters that are always positive It turns out this is not a bad not a bad idea And so I can say if the value is greater than the largest we've seen so far Largest Equals the value Okay, and when that loop is all done we can print the largest Okay, and so this is just a max loop and we're we're using this value. That's the number the values the second thing Oops I can't type python Oh, it's a typo Yeah, I'm not using value. I'm using v So largest equals v. Let's try it again Okay, so we're all done with seven. So these were the things that we were looking for And it was looking for the maximum and it just dutifully found seven was the largest But we also want to know what the word is And so what we can say here is we can say the word Is none Meaning it's it's it's just like we don't know what the word is And then whenever we catch this new largest number We say the word equals w. So we're so I like to think of this as capture Remember The word That was largest All right, that's what i'm doing our remem em Remember Our m m b r. There we go Um, so we're gonna this this trick here is Not only knowing what the largest number was but the word that was associated with the largest number So now I can print out at the end the word And the largest and that's the count Okay, and so we now we know that Oops, did we make a mistake here? Okay, that does not look good Because it says car And seven v is greater than the largest Oh It's not w I used a really bad variable. See that's the whole value there. There we go. It's k, which is the key key I was gonna say that was quite the bug See what happened there. I had this as w and it just happened to be it was the last word on the file car the last word in the file Because I used the wrong variable No little mistakes little mistakes The and seven Okay, so let's let's get rid of this print statement because we kind of know what's going on here And uh a way we go and this should now work if we run it And we can even get rid of the word done here There we go. The seven now the cool thing about this is this code runs just as easily with one line of code or the intro of the book intro dot txt and Not surprisingly that's still the most common word in the introduction dot txt. I seem to like that word and it's 226 times Okay, and so that is the basic pattern of Reading some this is just a word loop now. Sometimes there would be some You know checking to see if the line is the one you're interested in maybe tearing apart the line but it's at the end of the day this idiom of Starting a dictionary now. It's a common problem to to know where to start the dictionary You want to accumulate the numbers for the whole file. So you don't want to put it in here between line six and line seven Okay, so I hope that particular thing helps a little bit Helps you understand dictionaries Hello and welcome to chapter 10 now We're going to talk about our third kind of collection called tuples, but tuples are really a lot like this There's not too much to them They're really kind of reductionist version of lists there. So they they they function very much like lists and that You know they're they have things and the difference is is there are no square braces There is a parentheses round brace or whatever And they have positions zero one and two just like a list and you can look things up x sub two So x sub two is the x of the third element here and so that prints out joseph You can assign, you know make a tuple here This is the constant syntax for a tuple and print that out and the print statement shows you that This is a tuple not a list by showing your round parentheses And a whole bunch of functions that work with lists work the same way with tuples You can put a tuple at the end of an end statement into four as you might expect And then it iterates through the tuples tuples maintain order. So it prints out one nine and two so Literally this bit of code here could be identical whether it was a list or a tuple It really would do the exact same thing The difference between between tuples are that they are immutable once you create the tuple You can only sort of an assign a tuple, but you can't modify you can modify a list So if we take a look at a list here, we make a list that's nine eight seven And we say x sub two equals six. Well, that just means this seven becomes a six and that's just natural Meaning we can reassign slots. We can delete things. We can insert things. We can Mutate them. We can change them. So they're changeable Right they're changeable, but If we try to do that same thing with a string So we say y equals abc and we know that this is positions zero one and two But if we try to say let's change the c to a d by saying y sub two equals d That is not allowed and it says it doesn't support item assignment and this little Racket, you know x sub two is what they call item assignment inside of python And so if we do the same thing then with a three element Tuple put that in z and we try to change this slot to be a zero It's going to blow up because it's the exact same thing and that has to do with the fact that Once this assignment is made, this is not modifiable Now it turns out that the reason it's not modifiable is for efficiency Um, they take up less storage They are quicker to access and they're really designed internally behind the scenes in ways we don't really need to understand Um, they're just more efficient than lists to if all you want to do is store a list and look at it and then throw it away You probably should use a tuple instead So there's a lot of things that you can do with lists that you also can't do with tuples But they're really just a corollary of this notion of non mutability And so like you can sort a list, but you can't sort tuples You can add a five to the end of three two one can't do that in a tuple But you can in a list and flip the order dot dot dot dot dot dot dot So anything that you can do to a list that, uh, modifies the list not allowed for tuples And so you can take a look at the kinds of things that are inside the methods that are part of Each list a pen count extend index insert pop All of some of these many of these are modifying and then count and index are the only ones that work for Uh for tuples and so tuples are uh limited lists Now at some point there's going to be a but here to say why do we like them? And um the reason that we like them is that they're just more efficient They don't have to build in it python in its own internal organization of these objects It it knows that don't know we never be modified because when you make a tuple you as the programmer saying I'm never going to modify this and python won't let you do it So it's higher performance better memory use and you know to a beginning programmer that doesn't really matter But that's the reason and so we tend to use tuples when in Situations where we're going to make a temporary variable and then temporarily Use it just a little bit and then throw it away without really messing with it And we tend to use lists to build things up etc etc etc So the other thing that's interesting about tuples and we've actually sort of seen this is that you can put a tuple that includes variables on the left side of the assignment and This takes a little getting used to but it's really cool and no other language that I know of does this So if we say x comma y that's a two tuple both have two variables You can't put constants on this side, you know, it's like saying x equals four y equals fred Right, so what happens is is you can put a tuple on the far side of an assignment statement And the four goes to x and the fred goes to y And you say what's in y well y is indeed fred and so this is like two assignment statements Now the way I've got this syntax I would probably do you know two separate statements just not to show off that I know how to do tuples You know and so you can here's another one and they just move correspondingly if you don't have two here And you do have two here Well, if you have three here or two here and three here and you don't match the number there you get in some trouble Now if you just say x equals tuple then that is the tuple in the list But this is just a simple straight 99 value going into a so you can put tuples as the left hand side And you can even do things like return a tuple from functions. That's a real nice python feature that I like a lot Tuples are also related to dictionaries as we've seen in the previous chapter So here we make a little dictionary and we make an empty dictionary by constructing an empty dictionary stick it in d So d is sort of like this place that can hold key value pairs and we put csev And there's a two in there and chen one and there's a four in there So we have this you know associative mapping between c7 two and chen one and four All stuff we know and now we say hey We're going to loop through the key value pairs here and we've seen this syntax before k comma v So this is a tuple So you can think of this as each one of these things is going to get assigned into this tuple Which means the key ends up in and the first one's the key and the second one's value I use the vert variable keep kv all the time in code that I write just for my own sanity So kv are going to iterate successively through the successive keys and values In that so this is going to run twice and k is going to be csev two and chen one four the order Just happened to stay the same And so if you say What is in one of these things you can actually take d items the items method within that dictionary and say hey Give me back. Give me that to me back and then print tops and this is it's a special kind of a class But really ultimately it is a list of tuples You know this is two This is the zero and this is the two the one the first and the second and then within each thing you get you have a two tuple And so in a sense this k and v are iterating Through those things when we're putting d items here and d items there One nice thing about tuples is that they're comparable They're comparable in the same way that strings are comparable Meaning that they're compared from left to right with the leftmost or zero tuple being the most significant And it doesn't compare any further than it has to if they if the if it's asking less than So if it's looking at say this first tuple it starts at the left and says okay It asked the question tell me true or false Is zero less than five the answer is true And so the answer to this overall expression is true and it doesn't even compare those two numbers those second And third number they don't compare them If on the other hand we're asking is this less than that It only looks at the first one and asks if can answer the question the answer is well, they're both zero And so I can't answer the question So I have to go to the second one second pair and one is less than three And so that means this is true and it does not check this even though 20 million is bigger than four it doesn't matter because These are the numbers that cause the the true to happen And the same is true if If you do this with strings Again, we start the first one. So Jones sally. Well, that's the same. So we don't know the answer yet. And so sally sam. Well, okay s s, so they're the same a a they're the same. Oh l and m L is less than m So so the actual letter that makes the difference here is the l and the m and leads to us being true And so this it goes left to right, but then even when it's doing strings, it's going left to right. That's just how string comparison works and If we say say is jones Jones sally greater than adam sam. Well, we check the first one and we check the j and the a well J is greater than a and so we don't have to look at anything else We don't have to look at these any more of these characters We don't have to look at the second thing in the tuple We have to look at that is enough to be true. So it only scans until it has a definitive answer. It doesn't scan any further So now what we're going to do is use this comparable capability To sort these lists of tuples and then bring this all back and connect it more to dictionaries So now we can take advantage of the notion of comparing tuples and use sorting And so what we're going to produce is a list of tuples And then we're going to sort them right And so we can get a list of tuples from a dictionary and then we can sort that list of tuples And then we can add a sorting dictionary items by taking this two step process convert dictionary to a list Sort the list and then and then we can have assorted dictionary values Okay, and so we'll do this a couple of different times. So if we take a look at this code right here We have our happy little dictionary a b c map a maps to 10 b maps to 1 c maps to 20 Like what are we going to get here? Well, it comes out the mapping is the right way But the order is whatever and now we say this function called sorted which takes Inside a sequence and then returns us a sorted Version of that a list that's sorted and so it says sort d items So it's basically going to take this list And compare the a's and the c's and the b's and because it's a dictionary and all the keys are unique There's never going to be equality So it really is going to just sort this by keys and never get to looking at the values You can't you could construct a list that had duplicate You could make a list of tuples that had duplicates in the first like we did before But in given that this coming from a dictionary the first thing is going to always be unique and distinct And so if we say sorted d of items that we're passing This stuff into sorted sort is going to go around move stuff around And then give us back A sorted version it sorted in a sending order based on key without looking at the value And so the so that's a way to see Dictionaries sorted by key is just say sorted of d sub items and sorted is a function And so it just picks stuff and so this is the kind of loop that you're going to write to do that You know we did this before we took sorted and we got these sorted by keys And so you can just make this nice and simple for key value By the way, you can eliminate the parentheses here And I think it's prettier if you eliminate the parentheses, but you could put parentheses This is still a tuple without the parentheses For key and value in sorted so that says go through d items But before I go through them, please sort them So that means k is going to go through a b and c deterministically every single time It's going to go and of course value is going to go through the corresponding values So now we can print this out nicely Sorted by key and that's a real nice succinct little Way to say that I mean that again These are one of the kind of things that people really like about python is that you can do pretty powerful things with Easy to understand. I mean, you know, you might have seen this for the first time But ultimately you look at that eventually you're like, oh, yeah, that's I see exactly what that's doing easy not not hard at all So but let's say we're looking for the most common word, which we have been for weeks and weeks and weeks now And so we want to sort by values not key So this is an example of where we're going to construct a data structure We're going to imagine a data structure And then we're going to write code to construct the data structure and then that's going to make our problem easy So this is an example of using cleverly constructed data structures To do this and the data structure that we're going to create Is a list of tuples where the value is first and the key is second So you can just with items get key value I want value key. So Let's take a look at this code take your time and get it right. So k v goes in c items Well, that is unsorted and going to have go through whatever a b and c and whatever order And we're going to make a new list. So this is a data structure that we're creating temporarily And what we're going to do is this is a list And we are going to append to that list A tuple. So this is going to be a list of tuples Except we're not going to append them In key value order, we're going to flip them and append the first part of the tuple is going to be the value And the second part is going to be the key So we end up with this This is sort of our temporary data structure that we have constructed to make our job really easy So this ends up being 10 a 22 c 1 b now We just kind of flipped and we took this order and then we flipped them around And so now we have this nice little list sitting in memory in a variable and that's really simple We can say oh look we can use sorted and we can sort by now the values because they're the first thing The sorted doesn't know that they're how we produce this list. It just looks at that and says, oh, that's a list of tuples I'm going to always sort by looking at the first first item in any tuple And I'm going to add reverse equals true. So I get a descending sort. So I see that the The value that is highest ends up being first And so that changes this and I'm just sort it and then reassign it back into temp And I'll print this out. And so now you see it's sorted in descending order of key So it's value key value key value key, but it's sorted in descending order Okay, and so that's an example sort of up just like, you know, if I just made a data structure And I flipped those things around I could use sorted to sort these things There's many other ways you could do it, but there's sort of like the more elegant way of doing it And the the clever bit here is like make a new list and make it be a little bit different. Okay So here we're going to print out the top 10 most common words in a file And most of this code is review. So if we take a look at it We're going to open a file We're going to start a dictionary for our counting We're going to, you know, there's going to be words and lines Right and so we're going to have a for loop this for loop is going to go through each line And then of course we're going to split them which is busting them into pieces And then we have a for loop within that and this for loop is going to go through each word And so that means that be by nesting these loops We're going through each line and then within the line we're going through word Then we go to the next line and go through the words And eventually this line of code counts sub word equals counts dot get word zero plus one are idiom for making a histogram Right this line right here is an idiom if you don't know already what that is go back to the previous dictionary lecture And understand it understand it because you're just going to use it over and over again So now at this point and I always like drawing horizontal lines in code when we write it at this point Coming through at this point counts is right counts is the histogram It's not sorted so now we want to sort it so we're going to make a new list We're going to loop through key value and then we're going to make a tuple I'm I'm making this be two lines to make literally is your value key. So I'm flipping it right so Flipping the order of these things that's making a tuple and then I'm appending that tuple to the list Okay, so at the end of this We have a list of tuples in Value key order vk vk right So at this point coming through here, I've got in my lst variable I've got this really useful bit of code that are useful bit of data that I produced and then I'm like, oh Now it's ready to be sorted poof sort So take list sort it back and sort it into sending order and then stick that back in list Now we want to print it out, but we don't want to print it out We don't so we got a nice sorted list coming down here We don't want to print it out uh in value key because that's what it is It's in parentheses v comma k order, but it's sorted and we know that the most the highest value is here On down and so we're going to say, you know, we're going to run through and now we're going to go through this new list Only the first 10 Start at the beginning up to but not including number 10, which is the first 10 For value key in and so value is good So this is the iteration variable that's going to go through each of these things On and down and then we're just going to print it out flipping it So we re flip it flip flip we print it out key value and it's going to work Okay, so That is one way of doing this and this slide right here you absolutely do not need to figure out but some of you We'll look at this slide and you're like, why didn't you show us that in the beginning and others of you will be like No, no, no, no, no keep telling me this stuff here so I don't know exactly the term for this but this is A very procedural. This is a classic algorithms and data structures approach to solving this problem This next thing uses what are called lambdas and they kind of create what's called what I call a closed form Where you kind of do it in all one statement and there's all this implicit stuff going on So if you don't get this right away, don't worry too much about that but Roughly This single line does everything that bottom half of that program does. I mean if you go back If we go back to here, it's pretty much this line does everything Does that in one line? Okay It doesn't create the counts and it doesn't print out the top 10 But it does everything in that middle bit. So let's take a look at this So we all are going to collapse this down. So we have a print that parentheses the end of the print And then we have sorted and remember that sorted takes as input a list And so that's not too bad and returns us a list and so we'll print the return from sorted And then this is the funny part the fun part funny part. This is called list comprehension And we have square brackets and we say to python this is a list But instead of a listing of things or having a constant of one comma two comma three or a pen to pen to pen We're going to create an expression that will act as a generator for all the elements And so this basically says this is a list of two tuples v and k and then this is sort of implied For all k v in c dot items. And so this is like a for loop That is sort of driving this think of this is like stamp stamp stamp stamp stamp However many times it has to make a stamp. And so that's producing a list Right, it just manufactures this list and then that list is sort of manufactured in the moment There's no stock nuts not put in a variable it they it python makes that list according to the stamping pattern That you've told it to stamp out this list and then it passes that stamped out list without even storing in a variable In disorted sorted moves the list around because it is just a list of tuples And then gives us back the sorted list and so I didn't put reverse equals true on here But you see that this is sorted in ascending order now by key and I did that all in one little statement So so look at this there's a this is also one of the beautiful things about python that you can build these things And you can build more complex versions of this and and there's a lot of real elegant things that you can do in python They're really succinct you should be careful because in the beginning I think this is easier to understand even though after a while you're like wait a sec And why am I putting all these extra lines in because this is not so hard to understand But at some point, you know, you will want to master This more powerful and more succinct version of python that that expresses it in terms of the data You want to see rather than the steps you want to take So this sort of finishes up tuples. We've done a bunch of stuff. I mean really They're simple and elegant tuples lists and dictionaries are all related They're really three different kind of three foundational data structures three foundational collections of python and we combine those in a lot of different ways And now in this little bit of lesson, we're going to talk about some tuples And we're going to create a list of the most common words And find out how to sort a dictionary by the values Instead of by the key We're going to use the clown dot txt file and the intro dot txt file And i'm going to start with the code from exercise nine that I just did From chapter nine It's not exactly one of the exercises, but it's very similar to them And i'm going to make a copy and i'm going to keep it in the same folder I'm going to keep it in the exo nine folder and just call it ex 10 because this Code is going to Do much of the same stuff and it's going to read these same files And so i've got myself exercise 10 exercise nine is still here exercise 10 is now what i'm editing exercise 10 But i'm in exercise nine folder so In exercise nine we look for the the most common word, but we want to find the five most common words Which is going to require us to sort so i'm going to get rid of that code Because it's not really how we're going to do it there We manually loop through it and found the maximum and so i'm going to just run this cd desktop python for everybody exo nine And if I do an ls you see that i've got exo nine dot py intro dot txt So i'll run python three ex 10 dot py And run the clown data and we see that we see the dictionary and it's properly Making it in this code right here that doesn't change it reads the file Reads all the lines goes through and splits it into words and then goes through the words and does the The idiom of using dictionary get to maintain the counters and we print it out at the very end So the new code we're going to write is down here Okay So let's first of you do do a few things um If I can say x is equal to The dictionary dot items And this gives us basically a list print x this gives us a list of the key value pairs this prints out the dictionary But if we do it this way and use items it gives us the key value pairs Okay, and so that's what we got Key value pairs now We can sort this Based on the value because tuples can be compared This can be compared with this And because d is lower than r then this one is lower this whole this ran tuple comes after the down tuple So we can sort this whole thing and i'll do this by just Putting the word sorted here And say give me a sorted version of that now it's going to do it based on the order of the tuples This is going to be more higher precedents than this So if I if I print it this way run it again You'll see that it's sorted and now is after and car it's an alphabetical order by key And so we could actually print the first five Up to but not including five by adding a list on the slice of this as list slice here And so that will show you Only the first five right Except that that's not what we're trying to do we really want to sort by this Okay So We have this mechanism that can take a list and sort it based on the tuple values If we could create a list where it was one comma after instead of after comma one And make it the exact same thing then we could actually then sort it and it would be fine Okay So let me show you a couple of ways at least one way to do that Okay Get rid of this We're going to hand construct a list And i'll just call it temp equals give me a new list temp equals new list and then for k comma v in The dictionary dot items And i'll just start by printing k comma v so we see And this is where it's really nice to do these with the clowned code first and then only do your test on the bigger file later And so it's pretty much the same thing. We are going through in key value Order, which is dictionary order, which is not sorted at all Okay Now instead of printing this out. We are going to Let me let me do this in a couple of steps make a new tuple And I'll just call it newt equals parentheses v comma k Okay, so this is i'm saying make a new tuple This is like a new tuple with two items in it and i'm going to make the value for and the key the the key Okay, so Then i'm going to say temp dot append newt new tuple So i'm going to i'm going to end up with a list of tuples Let me comment this one out and i'm going to then when i'm done here i'm going to print temp So if i run clown.txt You see what happens in temp. It's the it's still well, let's print temp twice I mean it's not sorted. It's flipped Let's print it That's okay Well, just that's the flip one Okay, so it's flipped and all we did is we made it instead of car comma three It's three comma car, but now we have a list Okay So now it's flipped and now we can sort that We can say temp equals sorted temp So it says takes temp and sort it and give it back to me and now i'm going to say print sorted comma temp Okay, so here's the first print When we flipped it we've got two tent But it's not sorted at all But after we sorted it it's sorted by tuple and the lowest is one after so so you'll notice that One is the same as one so it checked the second item in the tuple so down comes before after fell becomes after down Intro on alphabetic order, but now we get the twos so that all the all the ones Sort there and then the twos come Here but then within the twos it's sort in alphabetical order because like a String if it if the first character matches then it looks to the second character and then we see Oh, here we go the threes and then the one we actually wanted the highest one is the seven And so one of the things we can do is we can say You'll notice that we want the highest one not the lowest one So we can just tell this With this parameter reverse equals true And we just say hey sorted do this backwards do it from highest to lowest rather than lowest to highest And now our sorted one says seven the Etc. Okay, and so we want the top first five We can say Up to but not including five So this is now the top five So the sorted one is that's the top five If there it's a tie we're going to go and reverse alphabetic order, but let's not worry about that too much for now So it it makes a flipped list then it sorts the flipped list Now if I just wanted to print it out nicer I could loop through this new list. I could say four V comma k remember this is a flipped list So the sensible thing is what's coming up. I mean coming out of this list Each tuple is value comma key In temp and I'm only going to go up to far up through but not including five. So the first five And so I'm pulling them back out as value key because that's what they are They're value key see value key value key Value key so v is going to go through these and k is going to go through these and then I'm just going to print k comma v So this is kind of my flipping backwards because I I want to see them this way And that's the most common one car three And so it's just going through this up through the fifth one and then printing them out Okay, so let me comment this out Let me comment that out Let me just delete this So we have a dictionary. Let me comment out the dictionary We have a dictionary. We make a list and we make these reversed Tuples where we have the value first in the key second. We're setting it up So the sort's going to work and then once it's sorted we have to flip them back So we we flip them for sorting from key value to keep value key for sorting We do the sort then we flip them back with key value and print them out And it works fine. So let's try our big file intro dot txt And there you go Those are the five most common words in intro dot txt So you might ask yourself. Why did we use tuples? We probably we could have really used lists for this But tuples are more efficient than lists and you notice that we weren't going to modify We did modify the temp list. It's a list of tuples, but the tuples within the list We're not we weren't going to modify and so we tend not to Make lists if we can get away with using tuples And so that's why we made this was what this flipped tuple thing. Okay So I uh, I hope that was useful to you. I hope to see you on the net Oh and welcome to chapter 11 regular expressions The fun thing about this chapter is unlike all rest of the chapters you sort of had to Really understand every single thing in chapters one through 11 built on one another one through 10 built on one another But you can really get along without using chapter 11. It's not a really required topic But it's a fun topic and an interesting topic. So uh, so you can relax a little bit and Realize that you may or may not like regular expressions. And if you don't like them, that's okay You don't have to use them you can go for your your whole life without using regular expressions the idea of a regular expression is that You you come up with a language. It's a little character-based programming language where you can You know do smart searching basically start searching and as you'll see in a bit will smart smart Smart extraction and it's uh, it's really almost programmable wildcard expressions There's no looping but there is looping and there's all this implicit thing You say look for patterns that look like this and then you give back things that match those patterns You know, we we do searching for everything And we're looking through large blocks of text Say go find me everything that has the word python in it or something like that So that's just such a common thing to do and regular expressions are a very structured way to go about searching for information They're very powerful, but they're also very cryptic and you may not like them But they're a lot of fun actually once you understand them and learning how to program them takes a while Writing good regular expression programs requires some try it play with it. Check it try it check it try it check it But once you get them they're they're really quite cool. It's a very old programming language You know, it comes almost from the 1960s the concept of it's the theory of computing Where they were trying to come up with theory of languages and regular expressions was one form of languages that computers could understand and so it has some fun old words and um One of the advantages of knowing regular expressions is that you're kind of a cool person You can take a quick look at this xkcd that sort of captures the The devil may care awesome power that regular expressions do and uh, and while we're at it You know and while we're talking about awesome. I do want to take this moment and show you my awesome tattoos and so You may not know this but I got a couple tattoos here. Here's the first tattoo This is where I went to got my phd and this is my university michigan faculty member position I got phd in engineering and I teach in a school of information in library science And then I have this other tattoo and this tattoo is what I call the ring of compliance I work on learning management systems and educational technology and standards And there's this standard called learning tools and our operability which if you're using this course And doing the auto grader it uses learning tools and our operability to integrate into whatever learning management system you happen to be using And one of those learning management systems is the open source learning management system that I helped write called sakai And these are the rest of the major vendors and the idea of that tattoo was that I would put the tattoo of every vendor that would comply with learning tools interoperability And so you'll notice corsair I help corsair put learning tools interoperability in and so the auto graders integrate into corsair Blackboard or canvas or sakai or moodle or often those are other things. So it's just like a cool techno thing Just like regular expressions So I've got a url here for regular expression quick guide. You might want to print this out so that you can Look at it even while you're watching this lecture Because it's a little programming language except that it's character based not line based and not keyword based It has certain active characters that That the character means something versus the character represents the character itself And so the regular expressions is not part of the base python, but it's distributed with python So you have to put an import re to at the top to say that's really saying pull in the regular expression library And there is a couple of functions inside that re.search Which is kind of like a really smart version of the find method inside of strings and re.find all which is kind of like Like taking and stamping your way through a loop through a string and finding all of the things that match a particular A particular pattern and then extracting those and we'll talk about both of these in this lecture So here's a really simple piece of code Where i'm just going to sort of show you sort of before and after So here's a thing where we're looking for lines that begin with from colon And so we open a file we loop through the whole file We strip off the lines a text and then we say if line dot find from is greater than equal to zero Then we print it it gives you negative one if it's not found and so Reads all the lines and once in a while it'll print it out reads all the lines one to all print it out So that's kind of like a needle in the haystack Do use regular expressions to do that we have to import the regular expression library These lines are the same we're going to loop through we're going to strip And how we now we're going to say if re.search the way to say this is within the library regular expressions go find the search function And search for the string from in the string line Okay, so this is the line to search whereas here it was more object running where we say line dot find Here we say re.search and we pass in line as parameter These two things are equivalent Which means most of the time it's going to run and once in a while hit a line and it'll print that out And then it'll finish the whole thing. So that is Taking and doing what we would do with the find operation with regular expressions now Searching with regular expressions has these special characters. And so here we have the same Basic code except now we're saying if line starts with from So we're not using find anymore And um that way we're only going to get that thing in the first position not like blah blah blah from Colon we don't want that to match we only want to match here at the beginning of the line And so that's we use line starts with so it's going to do the same thing and find lines that have the prefix and print those out and then be done Now in regular expression search we don't in a sense change the method We we have a certain number of things we can do with strings based on what they've built in But in regular expression, we actually can turn this first parameter into code And so what's happening here is the carrot if you go back to a little cheat sheet carrot means This is the beginning of line. It's a virtual character that matches the beginning line It's like from that starts at the beginning. So from At the beginning does match and from in the middle does not match by putting that little carrot there Same thing line is what we're searching and then from is what we cap carrot from Lining from at the beginning is what we're looking for and so again It does the exact same thing only prints lines that have from colon is the first character in the line So the difference is as we look for a method and the other one is we program the regular expression So we're going to run out of methods in the string Class long before we run out of things that we can do with regular expressions And so a couple other special characters that carrot matches the beginning of the line So carrot matches the beginning of the line. This capital X matches itself Dot is a wild card that matches any character and then some of the characters in regular expressions modify the immediately proceeding character and so that says Look for a line that starts with x and then has many characters. That's these two things Zero or more characters followed by a colon And so you can see that it's sort of it's this sort of like expanding stamp It's like oh, there's next to the beginning of the line that line It looks good. I got some characters here and then I got a colon. That's good So this is an x some characters and a colon check x some characters and a colon check x and these things You know a way we go and so you can that's what's going to match And so you can see how some of these characters are special and again go back to your cheat sheet Some of them are special and some of them are actual characters and this colon and x are just they're they're not special They're just the characters okay now Sometimes you want to be a little more clear on your match So let's take a look at these lines that that match that particular thing that we just did So we have these two x dash of colon x dash d stamp dash result Like these are from mail messages and then one of the mail messages has a line that says x dash plane is behind schedule And this matches Is that what you really wanted and so what we can basically say is because this is an x This is some number of characters and that's a colon It matches it has to match that that's this rule applied to this line results in a yes It does and so how can you be a little more clear as to what you want to match and what you don't want to match so we can write code so now what we're going to say is We want to match the beginning of the line and we want to care capital x and we want to dash So now we're going to match those first two characters x dash at the beginning of the line Carrot x dash says first two characters of the line must be x dash Now we have another special character again refer to your cheat sheet back slash capital s means a Non-white space character right Any character other than white space and then plus means one or more times one or more Non-white space characters. That's what this whole thing says one or more non-white space characters and followed by a colon Which is just a character. So now we have x dash followed by one or more non-white space characters followed by a colon x dash followed by one or more non-white space characters followed by a colon Here we have x dash followed by one or more. Whoops. There's a space there And so this doesn't match even though there's a colon there It means that between the dash and the colon You can only have some number of non-white space characters. So this is a No, it does not match. And so you just can if that if you didn't want to match this you then sort of create a more precise You know, we could even have a thing that said I want x dash with an uppercase character uppercase letter If you wanted to and so there's all kind of fine tuning if you sort of learn the structure that you've got to do And so that's kind of the matching where you're taking a whole line and taking this template and deciding if the template anywhere in that line matches And now what we're going to do is use this to actually pull data out of strings Using the regular expression library So now we're going to move from merely matching to matching and extracting So we're going to say hey, I would like to not only have you take this template this little pattern the string pattern regular expression pattern Run it across the line I want you to give me all the ones that match and I want a list of those and that's what we're going to use the find all So search gives a true false find all gives a list of all the strings that match So if there's four of them you'll get four things in the list If there's nothing that matches you'll get an empty list so Let's take a look at what we got going here. So instead of calling Search we call find all we still pass in the string that we're looking through and then we have our little template pattern And this is a new bit of regular expression any little bracket operation square bracket Is one character. That's just a character, but then they're in between here is a set of Of allowed characters So zero dash nine means eight single digit zero one two three four five six seven eight or nine But that's really one character and then we have so that's one character And then when the plus applies to that which means if we look at this whole thing This whole thing says one or more digits. That's the code We write in a regular expression that says one or more digits And we're just going to use that in our regular expression by itself So we're going to look for any string That's one or more digits and pull it out and give it back to me. So we look it's going to look So that's my little template stamp stamp stamp stamp. Oh got it stamp stamp stamp stamp stamp stamp stamp stamp stamp stamp stamp Oh got it stamp stamp stamp stamp got it. So what we get back after we ask find all to find all of the one or more digit strings is Two nine and forty two so it actually parsed it it split it it found all these things and said I found them all for you And here they are two nineteen and forty two so it's a list Of three strings because that's how many you found now might have found none And we've got an empty list at that point But it found some Okay, so just as an example, you know, we did this thing we get two nineteen and forty two But if I said this that basically is A uppercase vowel a e i o or u so that's one letter and that's one or more So it's saying, you know something like a a would match E i would match O o would match But if you look now it's saying, okay, that's i'm looking for one or more minimum one or more uppercase A e i o u is a set of characters one or more uppercase letters And so it's like look do you find oh there's an uppercase, but it's an m no no no no uppercase no uppercase no uppercase no uppercase found nothing Did not find anything And so it gives us back an empty list and so it's like find all the things that match this and the answer is none matched Here's your list of nothing Okay, and so that's and so you have to check that's how you have to check even if you got something because It's not going to return you false. It returns you a list with no items in it now The way it works like I said, it sort of is taking this template and stamping it across the line Scamping across the characters now there might there's a behavior that might not be intuitive you Intuitive you at the very beginning But the notion of what we call greedy matching and that is when it can match more than one possible string Overlapping string it chooses the largest overlap of the overlapping strings And so the easiest way to show this with an example And we're saying I want something that starts with an f with one or more characters and ends with a colon So that that's there. That's my little stamp. That's my template. So starts with an f good. That's good One or more characters. Da da da da have a colon. That's so that could be from colon That would match But look I've got another colon here and this is just continuing on with one or more characters And this so the question is do we get this or do we get this part? Right and the answer is with greedy matching is we get the larger of the two Okay, and so what you get back is somewhat counterintuitive You get the whole thing is the match from colon using the we could have got from colon But the reason it picks this is this one's longer. So anytime it has a choice It picks the longer one and that's what greedy is meaning you probably better described as larger or Tending toward the longest string Or something like that So you can of course suppress this behavior like everything in programming regular expressions use simply add another character and so now It's going to say I would like to start with letter f Any character one or more times and then this question mark. This is still one, you know one little thing um non greedy okay And so that's just says do the do it not greedy Which just means that it prefers the shorter of the strings and so now it could still match this string or this string But because it's been told to not be greedy It chooses this string instead and that's the string that we get and so that's the not greedy And you just had the question mark after the asterisk. So it's usually An asterisk question mark or a plus question mark. Though that's a two thing That's zero more characters non greedy and that's one or more characters non greedy Actually, most of the time the the it seems to me that the Non greedy would be the more reasonable default, but that's not how it is a greedy is the default and non greedy is optional now We can play some more with this stuff. Okay, and so Let's take a look at this little example where we have a non blank characters back slash capital s One or more of those non blank characters Followed by an at sign and then again one or more non blank characters So this is looking for strings that have an at sign with non blank characters on both sides This is an example of where it sort of comes to this at and it goes this way and it does it in a greedy manner If you told it to to not be greedy it would give you this These three characters, but we're telling it to go greedy so it goes all the way to here and stops at this blank And then stops at this blank and so that's a nice little thing find the at signs Go to the the first blank blank and pull that stuff out And so that with one little match you pulled this thing out now, of course, we've done that before other with other techniques So that's just another way to pull stuff out now if we We get this whole thing, but what if that's not exactly what we wanted we can tell We can We can give it a matching string that's different than the extracting string by adding parentheses And so here's another example where we basically say This is our string. We want to match from at the beginning followed by a space Followed by ignore the parenthesis for the minute One or more non blank characters followed by an at sign followed by one or more non blank characters So this is also going to if there's no from it's not going to be looking for that, right? So it's it demands the from is here So it matches that and the space is demanded as well And then it says oh non blank characters great. I got an at sign great non blank characters. Oops stop there And so This is what's going to match now the key is that we don't actually want that back in our extraction What we really want back in our extraction is this part right here So what we do is we put parentheses in parentheses don't our code They're they're code in the regular expression world parentheses say start your extraction and end your extraction And so when you do this with a parentheses when you when you do it, you know without a parentheses You get you get the whole from right without a parentheses Oh wait, no, okay, that that doesn't have the from in it so um But if you do that with the parentheses The you match the from but you only get the This bit to come out as well so you can add this to make the matching part more precise But without changing what you get returned and you specify what you want to get returned with the parentheses So next I want to show you just a couple of different ways to use these newfound skills So now what we want to do is use some of these newfound skills in some more practical applications of regular expressions So let's go back to the way we first tore apart strings and And look at the situation where we if you recall we just wanted the host name, right? This is an email address and we're interested in the host name So we have this string And we go find the at right the find looks up and tells us the at is it position 21 And then what we do is we say, okay, let's look beyond there to the space And that tells us the space is in position 31 And then we're saying we can extract starting at beyond the at sign up to but not including the space by saying at pose plus one colon space position And when we get that now we have to have a thing that decides to only look at this on from lines But then it can print out the host that is extracting of this information So that was one way that we did that right one way. We did it The next way we did this was the double split pattern Right, so we said, okay, let's take this line. Let's break it into words Based on spaces. That's what words is so that's 0 1 2 3 4 5 6 And then we know that the email address on lines that start with from space Is the second one so we pull out email address which pulls this bit out Into email and then we're going to split that again Based on the at sign So we're going to split this part again based on the at sign So it splits right there and then this becomes the zero and one in pieces And then pieces sub one is that host and if we print that out We get the host So that's the double split pattern. Nice thing about that is you don't have to keep track the little plus ones kind of annoying to use the space position The previous one that's just hard to remember. It's just I've written this code way too many times in my career and I've made mistakes And I have to debug it every single time and I print all these numbers out. I'm like, did I get it right? Oh, I did it in python. I did it in java. I did it in c. Wait a second. I did it differently. And so it's So this is a lot cleaner I mean I can write this every time and I know it's going to work every time I barely even need to test this code because it's so obvious So double split is another way of extracting stuff But if we look at this thing with the regular expression we can say, oh, okay Let's um, let's use a regular expression to do this So we'll start looking through the string. We'll start by saying, hey, let's look until we find an at sign Then let's start extracting with the parentheses And then once we have found the at sign Let's look for for for non-blank characters. This is a set of characters. This carrot as the first one means not A blank So that's another way to do non-blank not a set of characters which are everything but blank That's what this little bit is saying star means Zero more times which means it's going to run run run run until it finds a blank which is going to stop it The greediness is what keeps pushing it, right? It's this is a greedy match that asterisk is greedy because there's no question mark after it And so that does go and starts at the at sign With the parentheses Goes to the space and that's the end parentheses and that's what prints out now Y is going to be a list that's a one item list that has the string in it that we're looking for but You can just go sub zero to get that guy right out of there Okay, so that's sort of the regular expression version of it, but we can Make this a more fine-tuned thing So we can say look we don't we also want to pick the line And we want to know if there are if we don't get that line We want to skip it if we do get the line we want to extract the data and we can do this all in a single regular expression So again, we say start from the beginning of the line And if it's got to be a from followed by a space and then followed by any number of characters Dot star followed by an at sign. So it so this has to match We see a space then we're going to have any number of characters And then we're going to see an at sign and then we're going to start extracting And then we're going to go non-blank non-blank non-blank non-blank non-blank up blank and extracting And out that comes and this has the advantage of the previous one and that makes it much more precise There if we look at the previous one while it works on good lines It might actually trigger on lines that we actually don't want to see So this allows us to refine it. So it only actually does this to lines that we care about So it's sort of a both an if statement And and a splitting extracting going on all at the same time by having a bigger String that we're matching than we're extracting It's a way to kind of Clean up your data. So here is a simple program that we're going to just put all this together and actually accomplish something And so we're going to we're going to read through and look for lines in a file that have this form And we're going to extract this number and then we are going to Compute the the maximum of this. Okay, so we're going to extract this number and then convert it to float and compute the maximum So, you know, we're going to open a file We're going to write a for loop We're going to strip so we're going to do this for every line of the file But the first thing we want to do is not get line we want to discard all the lines Except ones that have this so our our regular expression is look for lines that start with x dash d spam dash confidence colon So that's a pretty strong match if that's not there We're not going to get anything and then there's a space. There's a space and then start extracting And then go as long one or more digits and dots That's a single character and that's one or more and then stop extracting So that says start extracting da da da da greedy greedy greedy greedy stop extracting And so that's what we're going to get now If the line doesn't have this it means missing in some other some way Whether it's this prefix or this number if the number is missing it's going to fail too We're going to get back a list An empty list So the first thing you have to do is check to see if you actually got a match So you say if the number of items in the list len of stuff is not equal to one Continue and so this is the this is the skip all the lines that don't match Skip skip skip skip skip skip skip so there could be thousands of lines that don't match But then when this match hits it's going to come down and fall through Right so so that Most of the lines will skip up But then when we actually get one and we know Instantly that we've got one and stuff sub zero because that's what we extracted Is this number and we can take the floating point of it. We impended to our list We made a list to store them that runs The list grows And then we just say what was the largest one and so you can run this And see that We have an escape character and the whole idea is is sometimes all these little special characters that make a lot of sense to us We actually want to search for it. So what if we want to search for a dollar sign? Well, we just prefix it with the backslash and that just means this is a real dollar sign So backslash dollar is a real dollar sign. So this says I would like A dollar sign followed by one or more digits or dots And so that's going to match a dollar sign followed by one or more digits dots are okay This is a set remember zero dash nine or dot That's a set of the list of legit characters. This is a range of characters That's a shortcut to how to make the set you could make it be Zero one two three five seven eight nine dot or zero dash nine and it assumes that and that's one or more So then stops because this is a space It's greedy matching Then it pulls this out So that's kind of why greedy has to be the default because because otherwise if it wasn't doing greedy matching Oops come back come back If it wasn't doing greedy matching It would If it wasn't doing greedy matching it would stop here because it would find a dollar sign non greedy would find a dollar sign And one character and then it would give us dollar one rather than dollar ten so in summary regular expressions are a cryptic but powerful language and They're they're an acquired taste. I think that I bet eventually you'll find them fun Even though on your first impression you might not think that they're so fun Welcome to network programs. This is chapter 12 now We're going to learn a little bit about how we talk to resources on the network using python Now this is a really quick introduction to how the network really works. I have a whole book that I wrote Um, it's also translated into Spanish On how the network works starting at the very lowest layer packets and everything Right on up and it's actually really easy to read. I wrote it for a high school audience It's a short book and pretty easy to read So if you read that book you will understand that there is this layered architecture The tcp architecture that sort of runs our network at the lowest layer that on one side here This is your computer and this is a server computer And if you sort of want a web page goes across the network does this like 15 or 20 times Then it goes up into the server reads the data and then the data comes back 15 20 hops for the packets and then it's shown to you as what you see And so that's how it works and there's all these layers that we're not going to talk about in this section But I talk about in that book The layers of the link layer which talk about how to get over one hop the internet layer which talks about how to construct Say 15 or so hops to get packets back and forth That's the the sort of lower level bits We're going to start at what we call the transport layer and that's the layer where Your computer sort of assumes that it can make a phone call To another computer another process running on a program on this computer talks to a program on this computer And then it kind of comes back Okay, and so we're going to we're going to leave this alone. We're going to ignore it We're going to assume that there's this nice reliable pipe that's going from point a to point b And what are we going to do with the pipe? But if you're interested take a look at the book So we're going to start with a pipe of some kind of a connection. We have two processes Process process and we have some kind of a connection between them And it is a connection that we can both use to talk And to listen In nerd terms we call these things sockets and that is one process running on one computer another process running on one computer Another second computer connected through the internet somehow and One computer speaks into that socket and it comes out and the other computer returns something and it comes And so this is a bi-directional protocol of data, which is a series of an effect Data phone calls between applications. So the application might be on your side. This might be your browser Chrome firefox internet explorer on the other side. This is a web server Might be internet iis internet something something from microsoft or apache or Java tomcat There's another program and you are making phone calls between these programs now in general These servers here stay up all the time and you sort of just can make a request when you feel like it on your In your program, but that's what we're going to do and this is what we call a socket So that little connection that phone call that data phone call is what we call a socket Now you have to decide which of the systems you're going to talk to and then which of the services on those systems or which process And so we have this concept called port numbers and they're best thought of like extensions on phones So one organization has one phone number and it says please enter the extension of the party. You'd like to talk to Well, that's kind of what ports are there like here is i'm a server and i'm connected to the internet Please enter the extension of the process that you would like to talk to And so for example, there might be processes running on various computers And so the email is known to hang out on port 25 or extension 25 log in insecure login lives on port 23 Insecure web lives on 80 and secure web lives on 443 and there's a couple of different protocols Say if you have your mail stored on gmail and you have a local Mail client say like thunderbird or apple mail that talks a protocol to pull that mail across and those live on various ports So these ports are those extensions and by convention We have standards that tell us what to roughly expect at those ports So when you're talking to port 80, you expect to talk to a web server or an http Server if you're talking on port 23 You expect to talk to a telnet server and on and on and on and on and on and so these are the extensions the typical commonly used default extensions for various network application Processes that are serving us data Now sometimes you'll go to a url and you'll see in that url There's a colon and a number that means it's a web server That's running on a port other than the official 80 or 443 port Now in python we can talk to these sockets right we can just talk to them and it's really easy Surprisingly easy We have to import socket because that's a library. It's Comes with python, but until you can use it. You can't use it in your program until you say it And then you basically in the socket library call socket function. That's what that syntax is saying um You're making a socket now the key to a socket. It's it's sort of like a An unopened file handle. It's half of a file handle It's an it's an outward looking thing that's not yet connected these parameters You're just going to type them in this says we're going to make a socket that goes across the internet And it's a stream socket Which means that it's a series of characters that come one after another rather than a series of blocks of text There's another kind that's harder to deal with but we're going to do this So this don't worry about this line. Just know that this creates a socket but not does not associate it The very next line we get back a socket a socket object in this variable that i'm storing in the variable mysoc And then when you want to make a connection across the internet to the far end You say oh, hey dear socket extend yourself across the internet Make the phone call to this host data dot pr for e.org and on that port 80 So that's making the phone call. This is like the phone number and this is like the phone extension So that's we haven't sent any data yet. We have simply Rung the phone of a process. Hopefully living on port 80 if it's there great This might blow up this one here won't blow up But this line here we could blow up if there's nothing sitting on that process It would come back and say oh you try to call you got no answer That's a legitimate thing to happen. Maybe you don't have a network connection Or maybe that service is down on that server or the whole server is down but So I just it's kind of amazing that we're sitting here in python and in three lines we have Uh, probably a half a million engineers have built this thing called the internet all these protocols and all this software And we just made use of it in three lines of python in a case This is one of the reasons that people absolutely love python. Absolutely love python So now that we have a socket we have to ask ourselves What kind of data are we going to send and then what kind of data are we going to expect to receive across that socket? So now we have a socket. We are going to talk about what we're going to do with it, right? So the socket basically functions at this level your application is saying make me a socket Which is sort of this endpoint and then the connect actually connects to an application on the far side And there's a port involved. So that might be port 80 and this this is the far host and that could be Py4e.org or data.py for or for e.org Okay, and so the socket is solving this and and the question then is What are we going to send and what are we going to expect to get back and that's what we call the application protocol So we know that these two have made a phone call There's no different than making the phone call and saying, you know, hello, right? And uh, everyone knows that when you the phone rings and you pick it up, you're supposed to say hello And that's part of our protocol. So who talks first, right? So the dominant protocol that we use on this in this section is the http protocol It's the key is hypertext transfer transfer protocol. It's dominant. It's really easy to use That's why I use it as an example But realize that there are many others like mail and file transfer and remote login and all kinds of other protocols Each is a different application protocol. They all use sort of sockets at their lower level But then on top of that they layer the rules of the road for retrieving Hypertext web pages and we have used these for all kinds of other things So the protocol like I said is like who answers the phone first. What do they say? What happens if the person doesn't answer right? Can you hear me now? Those kinds of things and it's a real simple thing and it and all you really need to do Is so that both sides can agree You have to write a thing that's like the rules in the middle and say okay everybody as long as we all do this We'll be fine It's as simple as picking on which side of the road the cars can drive on it works fine no matter which side But if each car randomly picked it would be really kind of a mess So if you look at the typical url and this is one of the things that the web innovators in 1980 Really invented that was wonderful and and it seems second nature today But in 1990 it was rather revolutionary and that these uniform resource locators Encrypted included in themselves a protocol the host to connect to in the document to retrieve So this is one of the Clever ideas that the web came up with because we used to have to pick a program like fdp or telnet or whatever SMTP then we had to go to the right host and then we had to talk to that host a certain way So in HTTP it's a really simple protocol invented in 1989 and 1990 by Tim Berners-Lee and Robert Caillou at the World at the at CERN And they created a protocol that we have grown to know and love and use for way more than retrieving documents As we'll see in the upcoming chapters So we're going to talk a little bit about what happens when you click on a page that has a link now There's all kind of fancy stuff that can go on but this is the basics And so let's just imagine for the moment you start sitting looking at a web page Dr. Chuck comm slash page one and inside that there is a hyperlink. It is a Indication that says when you click on this page go to a different page and in that you see The name of the page that you're supposed to go to So we click on this link And that is a browser. This is an application. This is a process Or an app That's running on your computer. This is the browser Okay, and when the browser sees the click inside your computer Then the browser makes a connection to port 80 on the web server dr. Chuck comm And sends the request this request that it sends Is precisely specified by a standard which we will see in a second Then the web server does some magic work Oops, let's go back Then the web server does some magic work in here read some files run some code does whatever Constructs an answer to our phone call and sends it back And it sends in this case back a web page in the format of HTML the hypertext markup link which is different than HTTP Which is the protocol that we're exchanging HTML is the format of the document we're getting back and in this has an anchor tag href and the end of anchor tag and some highlighted text and now your your browser gets this back And then renders it according to the rules of html and css and JavaScript etc parses it and then makes a pretty web page And this web page happens to have a link back to the first page And if you click there it will do this over and over and over again And that is the request response cycle and that's governed by a series of internet standards These are standards that were built in the from the 60s 70s 80s and 90s and continue to this day Brought by a group called the internet engineering task force or ietf This documents they produce are called rfcs which stands for request for comments The rfc the word rfc is kind of like a sort of joke as it were it's a it's a They're they're trying to be kind of funny in that Funny is not the right word. It's it's ironic in that they're trying to say even so in the protocols of the internet that we've used for Several decades They're always interested in improvements and that's what the rfc stands for and they're all named rfc dash whatever And if we were going to cruise around we could find some various rfcs and this is rfc 2616 There it might have been revised since then but this is like a document And this is what they look like hypertext transfer protocol Version one and so you're reading this document You're going to write a browser and you want to talk to the application protocol that is http This is one of many documents that helps to find what http is So if you look down and look down said, oh, here's what a request looks like This is how i'm going to get a get a document from the server and you keep reading and you keep reading and it says Um, you're supposed to have the request method with a space with the request You are the crest method with a space with the uri with the space the http version and the carriage return the line fee That's what it's saying And so it looks kind of like this right we say get The document followed by a space they've got to be one space you do two spaces And it's going to be quite frustrating. Okay, and so this is an example that you can run on a number of Uh On on linux operating systems and when uh macintosh operating systems with no changes if you install telnet on your windows box You should be able to run something like this as well so telnet is a Program that we used in the old days It used to be how we logged into servers, but because it doesn't encrypt your data back and forth We don't use it anymore, but it basically is a program that can open a socket to a host on a port And i'm saying telnet to this host on port 80 And at this point I am connected and whatever I type on my keyboard is going to be sent to that server Now if you're doing this you probably want to cut and paste this really fast Because if you take too long most web servers will be like you're a human. I don't want to talk to humans I want to talk to programs. So remember to type this fast enough and then you have to hit enter Twice so you have to have a blank line here Just type this exactly as it's shown And then you will get back to a server if you do it right the server and the server is properly configured The server will give you back some headers And this is metadata about the document you're going to get for example It's saying it's got text slash html Which means that the remaining stuff is going to be an html hypertext market language It has a blank line and then the actual document and then the connection is closed And so if you do this you can set this up in a way that you can run this on your own computer and in effect hack Through the back door a web server now you can't hack the secure web servers and Male servers used to be easy to hack but they're harder to hack now because they challenge you for information but far the reason i'm so obsessed with The command line is this is how real hackers work and they know how to talk some of these protocols more directly And so we think of this beautiful sophisticated application talking to some other thing And it's all pretty and we got wonderful clicky buttons and nice usability But the reality is like in the matrix reloaded here The kinds of things that really talented hackers are doing Use command lines and um, and they really know what's going on and that's how they do it They understand what's going on better than the developers of the computers that are trying to be resistant to the hacking So i come from a long line of using the command line and that's why i encourage you to use the command line in this course So the next thing we're going to do is we're going to go up into the application layer and instead of typing those commands by hand We're going to actually send them from python and write a very simple python web browser In this section, we're going to write a web browser using python So we've already got a socket. We know how to write a socket in the previous section We played with the protocol and used telnet to do it by hand and now we're going to do it in python And what you're going to find is it's not that hard So here we go So the first three lines of this program import socket make the socket remember the socket Isn't really got the connection. So when you make the socket Again, we're going to make a stream based socket and when and it's suitable for going across the internet The connection that it's like ring phone call Connect to data.pr4e.org and port 80 and so that basically says extend the socket across And connect to a web server and so there's got to be a piece of software running And this will blow up if the software is not running Okay So then Now we've got a phone we've made a phone call Now whether or not the remote side says hello or not Is up to the application protocol and in this case the web servers say nothing and they wait for you to talk first So we're the web browser in this case and so we're going to talk first And we know what because we read the documentation. We know that we're going to send get blah, blah, blah, blah, blah, blah Space blah, blah, blah, blah each t1 and then two new lines return return. Remember you had to have a blank line We'll talk a little bit about this end code. It's preparing the data to go across the internet And then we say send it and so this basically takes that little string and sends it across the network And then this piece of software is waiting for it And then the software goes and reads a file or does some other stuff and then it starts sending us data back Which we can then choose to receive. So now we write a real simple loop We're going to receive the first we're going to receive these things 512 characters at a time So we're going to loop through receiving 512 each time And if we get zero characters, that means it's end of the stream. The stream is closed And if you look at the little example from the previous one, you saw connection closed When the connection is closed, we get an indication that it is because we ask for some data and we get zero data Otherwise, if there might be more data, if this will wait if the network is slow You'll see if you do a print statement in here You will see that this will pause from time to time on a really slow network If your network is fast, it'll just go blank and it'd be so fast. It won't matter But this is how we go. So this is basically until the entire Socket until the socket is closed. We're going to read this data And because this data is coming from the outside world, we have to decode it before we print it And then when we're all done we break out of here and we close the socket So literally that Is an entire web browser written in 10 lines of python And again, this is why everybody loves python So this is what this program will show if you run Um, the get is sent. It looks exactly like doing it by hand. Um, you get some headers Again, this is metadata that tells you something about the file. In this case, one of the important things is what kind of thing is coming next There's always a blank line between as a break between the headers and the actual data the metadata and the data And then here is the actual text of that romeo.txt file And then it's going to run this when a print data dot decode all this is coming from the print statement If you're going to parse this you have to know that you're going to read the headers up to a little blank line The blank line is your indication as a software developer that the headers have stopped and the actual text begins And you know the syntax. This actually could be a jpeg Or png or some kind of image, right? And this data would here look like So if you type this and you change that code to actually talk go retrieve the jpeg url gibberish will come out Okay Um, and so that's exactly what you will see. And so now you have built a very simple web browser Next I want to talk a little bit about the what happens when, uh, characters transition Out from outside your computer. I mean from inside the computer in strings out across these sockets to servers and then back Hello, everybody and welcome to some worked sample code If you are interested in the source code you go to materials And download this sample code dot zip. I have this Downloaded it'll be in a folder called code three on my computer. This is where I'm at. I'm code three folder And this has a ton of bits of code here So if I do an ls you'll see I got all these files here and so, um We'll just leave those there And so this is the one I want to work through right now is this socket one dot py code and Basically what we're doing here is we're simulating. Uh, what is going to happen in a web browser And the cool thing about the html the htdp protocol is that we can we can do this by hand And i'm actually going to hack this htdp protocol This is going to go to data dot pr for e.org and retrieve a document um And so i'm going to do telnet To now you can do this on a mac and linux and if you put telnet on a windows box You can do it here data dot pr for e.org And I want to talk to port 80 and the port 80 is a different port It's a non-standard port, but what we're doing here is talking to the http port And so i'm going to be able to hand send commands To the web server and retrieve a document So i'm gonna cut i've already copied this string this get httpromeo dot txt i'm copying that into my buffer because if I wait too long This won't work. So here I go and now i'm going to type that and I have to hit enter twice And that literally was the http protocol What I typed there was the htdp protocol and the web server responds with some metadata about the document How many how much data there is the kind of data is there? um A blank line separates the header information From the the body of the document if I was to go to this in a browser right there you would see And if I turned on developer console And I went to the network. Let's make this a little bit bigger You would see that It retrieves this file romeo dot txt And it gets back that it tells us it shows us the headers and it shows us the response So this is all the same Way of doing the same thing and that is how to do the http protocol Okay, but now we're going to do this in python. And so here's the code. We're going to write So we're going to import the socket library and we're going to make a socket Now this doesn't actually make a connection think of a socket as a file handle That doesn't have any data associated with it yet And then what we're going to do is we're going to reach out and connect That socket to a Destination across the internet with the domain name of data dot pr4e dot org And the second parameter in this tuple this This is a function call with a single tuple as a parameter And so tuple sub zero is data dot pr4e dot org and tuple sub one is the 80 which says I want to talk to port 80 That could fail it will make the connection and And if the port 80 is there a way it goes and then we're going to actually send the http command So get this is the http rules followed by an end of line Followed by a blank line. So you saw me do this there This was what I typed here and I had to type a blank line Now if you want to go read the rfcs for how to do this you can figure this out So the only other thing that's kind of weird here is We have to add this dot in code Um, and that's because there are strings inside of python that are in unicode And we have to send them out as what's called utf 8 and in code converts from unicode internally to utf 8 So this command is a set of utf 8 bytes that we're then going to send it still has that same set of characters in it Um, and now we're going to send it and that's After after we've made the connection We're going to send these two things and then we're going to wait And my my sock is like a file handle at that point because it's been opened and we've sent data The htp protocol told us what this we had to send and the fact that we did have to send it So now I have just a simple while loop And I'm going to ask up to 512 characters And you know receive up to 512 characters and get that back If I will know that this is the end of file if I got no data back So if the length of the data the abider ray that I got back is less than one then I'm going to quit Otherwise I'm going to print the data and I'm going to use this decode which is kind of the opposite of this end code What I'm getting is utf 8 encoded data Most likely and decode basically converts it to the internal format called unicode that runs inside And so this is going to run a bunch of times pulling in the blocks basically 512 up to 512 characters at a time Printing it out. And then when it's all said and done, we will close that connection. And so it's not too exciting python 3 socket 1.py And you'll see that it's just going to python is now going to do what I did by hand Now, of course, the interesting thing is these are all in strings, right? And so You know this way we could write code that does stuff with this But all we're really trying to do in this particular situation is show how you open a socket Send a command and then retrieve the data Okay, so now it's time to teach you a bit of complexity about text processing up till now We've kind of been ignoring the complexity of text processing Everything that I have been doing most of what I've been doing is in ascii Uh the latin character set the character set that you know United States Europe lots of western civilizations use this character set and if you go back to the 1950s and 1960s They we were happy to have one computer and we didn't care what the character set was as long as what you typed on the Keyboard came out on the printer the internal representation uh didn't matter and As the 70s and 80s came along certainly 70s we needed some interoperability And so they standardized that character set But they standardized that character set Certainly in the west that did not represent anything and so if you look at this At this sheet basically what it's telling you is for the various characters There's some non printing characters white space Non printing characters and then here's some printing characters like the and key the zero And then the uppercase characters and then the lowercase characters and there's 128 of these possible values And there are nothing even for spanish or french in here And it's also why by the way uppercase lower letters in latin sort lower than lowercase letters And we saw that in some of the string stuff And what these do is it maps and says okay um And a lowercase a maps to the number integer number 97 which in base 16 is uh 61 and in octal it's 141 but in binary this is that it's eight bit numbers and so these are eight bits Otherwise known as a byte And they're very efficient like you know when you buy a buy a disk drive It's megabytes or gigabytes or whatever that's how many of these kind of characters that can store but unfortunately This doesn't work for more complex characters. You can figure out these numbers inside of python by By using the ord function And so you say what is the ordinal or the numeric representation of the uppercase h or case e And new line is a character as well And so like 10 is the ordinal position of new line And this actually has to do with sorting so that e lowercase e is higher than uppercase h And that's just because in the simplest of sorts. We just sort them numerically So new line if you go back to the previous little sheet new line is this 10 right here It's that 10 which is a line feed and that's a 10 and that's why when we print new line out we get a 10 And so again we in the early days when sim strings were simple. We just represented them as one byte per character, but the problem is is that You know as we have gotten more complex and in today's modern world It's simply unacceptable to say that the only thing computers can understand is ascii And so this leads to a very very From the simplest of character sets to a super complex character set called unicode which basically is billions of characters potential billions of characters For every language and every character set and because there's so much space in unicode It's easy to take very small variations of characters and give them a space. It's so large that you can have You you can have you can have pretty much any character that you want. So that's unicode The problem is is that if we sent unicode across the network It would be way too large. It'd be this utf 32 which instead of being Eight bytes per character would be four bytes per character And so it would take all of the data that we build and make it four times larger And and it'd be very difficult. And so what they've come up with is ways to compress this And utf 16 is this weird thing utf 32 is really sort of the full unicode pretty much utf 16 is a subset of unicode. It's it's it's used in some countries But the best practice for moving data across the internet or In a file that you're going to move between computers is what's called utf 8 And so what happens is is that utf 32 is fixed length ascii Ascii is one one byte utf 16 is two bytes utf 32 is four bytes and utf 8 is dynamically has dynamic length Meaning that it is one to four bytes And if it's only one byte long, it's perfectly compatible with ascii meaning that an ascii file is also utf 8 And so here's this little sheet and it's not critical that you understand this graph too much But basically as time passed 2000 internet's coming coming coming coming out 2014 Pretty much overwhelmingly the documents on the internet that you might retrieve our utf 8 Now so utf 8 is the recommended practice and it's sort of a compression utf 8 can represent All the things utf 32 can represent. It's just a compression of it so that with an overlap of ascii, which is Awesome. It's what you want I don't even talk anymore. So in python. We have always had sort of Two ways of representing strings in python 2 Uh, the normal string was a byte string was an ascii string was a latin string And if you wanted to represent unicode, there was a separate kind of object that we had and um, and so you would you would do that and um In python 3.0 Or later One of its main features of python 3 was to make unicode and string the same so that that means inside of python When you have a string variable It's a unicode Whereas inside of python 2. It was a byte variable and so Now we have this notion separately in python 2 and on python 3 Where we have byte variables and so byte variables are in effect an array of bytes So if there's through a b c that means it's three bytes. It's three bytes long Whereas a string might be three if three character string might be anywhere from three to 12 bytes long So python 2 had bytes and strings that were the same Bites and strings are the same and unicode is weird and in python 3 Strings and unicode are the same and bytes are weird Okay, and so that's that's what we've got to deal with and there'll be times when we get bytes From apis when we call things we have to then figure out what kind of thing those bytes contain because the bytes might contain Ascii they might contain utf 8. They might contain various things And so internally all the strings in python 3 or unicode Most of the time if you're inside the program or reading and writing files We just work and that's why we haven't mentioned it But now that we're talking over sockets and we're talking to the sort of random world out there We have to be a little more aware of the data we're dealing with now The good news is 98 of the time or 95 of the time It's utf 8 Which might also include ascii and so it's quite nice, but we have to we have to be aware of this And so if we are going to take data that comes off of the network in the bytes Then we have to make sure that we interpret it or decode it And in the right way so that internally the strings which are unicode are properly represented And so that's why when we read data in from a network connection like a socket We have to say hey decode it now. There's a couple things going on at that moment of decode And so this is where we're doing it We're see this we have to manage this in this code where we before we send this stuff We're going to encode it which takes a unicode string and turns it into utf 8 bytes There's actually parameter here that you could do a different than utf 8 But no one ever does you might have to for certain situations But so that says that we're going to encode this into utf 8 Before we send it and then when we get something back before we print it We're going to decode it and that's how this ends up working out And if you look at the documentation You will see that sometimes it says it's a string Or it's bytes and so So the you take a byte array and you decode it to get a string and you take a string And encode it to get a byte array and so that's what we're doing So you can think of the process as this way and that is The network has these utf 8 Mostly utf 8 resources not ascii If it's ascii it's okay So you read with the receive so this receive here pulls data Well, let's we have a unicode string. We're going to let's start with the send. So up here We have a unicode string. That's a unicode string even though there's no special characters in it No asian characters or french characters. That's a that's a that's a unicode string And before we can send it we have to send it in utf 8 if that was if that had asian characters It'd be okay because and that would be set up just right so that the utf 8 would be right So we encode it first and that's the cmd. This is now bytes Okay, cmd is bytes and then we actually send the bytes and that goes across the network We get back our thing and we receive and we receive into data. Well data is bytes Not string. It's bytes. We can say how big it is Functions kind of like a string and it has len except that it is one byte per character Which means some of it might be utf 8 and then all we have to do is say decode again You could if you were dealing with a situation where you weren't expecting it to typically be utf 8 or ascii You could tell it utf 16 or something and it's more complex But the simple thing is do you just say i'm going to clean up my data on the way in I'm going to clean it up by running it through decode and i'm going to encode stuff on the way out And so sockets are the place where this comes into play And so you'll see we'll always do this encode and decode every time we're sending data kind of outside of python and inside of python So now that we've talked a little bit about character sets. We're going to make this even easier So you don't have to use sockets. We're gonna url lib is a bit of python Code in the library that does all this hockey stuff for you Okay, so now we're going to write a web browser again in python But it's going to even be shorter than what we did before we did it in 10 lines using sockets Now we're going to do it in four lines with url lib so url lib really Is just because the idea of opening a connection sending a get request sending the new line Retrieving the stuff breaking the headers out doing all this stuff That's so common. Why not put it in a library to save ourselves some effort So here's how we do it. We're going to read it in All right, we're going to import this library So it's not part we had to import sockets before but we're going to import url lib now And so this is really quite simple. It's like elegantly simple You say url lib. That's a library. That's a part of a live module within the library and this is a function So let's call url open and then give it the url Now that's a string which it's going to encode automatically for us So it's taking care of all kind of pretty things for us. It does the get it does the encode Look back at that previous code. That's kind of what url lib is doing for us. Okay Now what url lib also does is it makes the connection Encodes the get request and then it actually retrieves At this moment it retrieves all the headers and keeps them for you for later You can get the headers But we're not going to see the headers and it returns to you an object that looks pretty much like a file handle Because you can put this In the for clause after the end now it's going to read Run that loop one time for every line of this file And so the lines we're going to get back are bytes and so we have to say decode It doesn't do that for us automatically We are going to have to decode them and that's because we might need to decode them with a particular character set here And then we're going to do our strip and we're going to just print this out So that's just that's like open a file read through it and print it This is open a url read through and print it and that's as simple as it is And so that's what happens. This is romeo.txt and it prints out now the thing to notice is that There are no headers here. The headers have been sort of consumed in the url open Again, there is a way to say hey, give me my headers But for now this just going to eat the headers and keep them And then you get to read all the data and the loop runs and this loop runs four times and I'll come the four lines You can go ahead and run this one. It's super easy. I mean literally super easy And if you you can do anything you want, I mean treat it like a file You just have to remember to do the decode bit When you treat it like a file and so we that code import it. We're going to open it. We're going to make a dictionary We're going to loop through We're going to split it. We have to add the decode just to make sure because that line is bytes not string And then we're going to go, you know our words We're going to go through the line and then each line we're going to bounce through the words The inner for loop is bouncing through the words and then we're going to go to the next line And then we make ourselves and this a dictionary and we print that dictionary out now this is This in effect other than you know importing this opening it differently and doing the decode This is exactly how we would process a file and so by using url lib You really sort of reduce the complexity of retrieving and reading network resources To the same complexity of reading and dealing with a file locally on your hard drive, which is Kind of pretty so One of the things then we can do is read web pages. That was a text file, but you can get html And so here's how you read a web page and it's the same kind of code. We open a We open a url This one happens to have html on it and we read through it and out comes the html. Remember that the headers are there But they've been eaten by url open for us And now we could write a browser that would parse these less thans and greater thans and Make links etc etc etc So if you can come up with ways to find these links You could actually write A bit of code that would then have a loop that would go up and open a new one Pull out the links open a new one pull out the links open a new one And so you could you could make a thing that would retrieve a great program that would retrieve a page Find the links in the page and then retrieve those links and we'll actually do that before the end of the class And so python is a very popular language at google and I wonder if they're I'm gonna I think it's a pretty safe bet that the first crawler that they wrote just crawl the web to build the index was python because literally that's all it takes to read web pages and Pull those web pages into your web crawler database So I don't know are those the first four lines ever written to google? Who knows? So the next thing that we'll talk about is how you handle that html HTML is kind of yucky and nasty and so it's not as simple as regular expressions regular expressions might help Strength parsing and slit might help, but it's just too crazy So we'll talk a little bit about how to use a library to make html parsing a lot easier We are going to be talking about Some code if you want to download all the code. It's right here. It's all single big zip file and All this sample code the one i'm going to talk about is urlib 1.py it is Not very exciting. It's short That's what's kind of nice about python code And it's really if we go and take a look at the code we played with just previously which is socket The idea here is urlib is something that python has produced for us to make socket communications and http communications a lot better So socket what's this is making socket calls underneath it, but there's a library that makes this quite simple And so we have to do some imports so instead of importing socket We'll import these or we are going to create a handle you are able to request url open and just pass in a string So we're not encoding this we're not sending get command all the stuff we did in the previous Sockets example is gone and then we can just put this as a for loop And so we're not using this lower level read and write code We're just using a for loop and so that literally is going to read the text line by line And the line does come back as an array of bytes So we have to do a decode but then we got a string and then we can do a strip on it So this is like a super simple Uh super simple So there we go now the interesting thing is is you also don't see the headers We just read the contents now it turns out in url lib And we'll see this in later more complex application You can get the headers if you want you can get various other things So that's url lib a simple url lib tool Now we can also use this in url words to to to show you something quite interesting and that is If you look at this from right here other than the decode This is exactly the code we wrote to compute the words Right, so line other than this line dot decode. This is just a open something up in this case We're going to open a url. We're going to create a dictionary We're going to loop through each of the lines in that thing We're going to decode them and then split them. So once you do line dot decode This is now a legitimate internal python string. We split it We run through the words and run the counts and so this is exactly like code that we did before to run counts and so python 3 url words And so that gives us a dictionary which is the word frequency and we could do all kinds of crazy stuff in here You know with sorting and all the kinds of things the important thing is once you've done this in this The code other than the need to decode these lines when you first get them It really works just like makes a url. Lib Makes urls function inside python very much Like files. So these are start short and to the point and very simple and I hope that they were useful to you So now we're going to talk about what you would do with the web page once you've retrieved it in a python program Call this web scraping And so web scraping or web spidering is the act of retrieving a web page Extracting the links from those web page making a queue of un retrieved links and then moving on And eventually the idea is if you had enough time energy bandwidth and storage You could find your way to most of the web pages on the internet that are pointing that point Point to or are pointed to by other web pages And so you might have all kinds of reasons to scrape data. You might have a blog that you posted you might have um, who knows maybe you put some data in a system, maybe Maybe the system is being shut down because it's being Retired you can do all kinds of things. You could write a little thing Just talking to somebody who wrote a thing to retrieve something and check and then send a text when something changed All kinds of stuff or you might make yourself a search engine But be careful Not all websites are happy about you Using a robot to retrieve their content Some of the websites as we'll see you demand that you're logged in and they track what you do And if they think you're doing something bad, they will shut your account off Other websites will track what you're doing without you logging in but then shut your address off And uh, and so you have to be careful. You should read up. You should figure out what sites Allow you to scrape them now. I have some sites that I've set up that you can play with to make it so that it's a um legit so Parsing html is difficult you some of the simple examples You know, you could probably write a regular expression or You know, certainly some splitting and some whatever and what you would find is you would write that code And you're to retrieve your first five web pages and it would seem to work and then it would encounter some really weird But legitimate html Or maybe even sort of slightly broken html So the web is full of broken html and your browsers just look at it go like oh, wow more broken html But they don't put up error messages and so people just leave broken pages up But your python program is going to see those broken pages So what you would do is you'd be like, oh, here's a new weird way to do an anchor tag I'll change my code Oh and then run for another hundred pages and like, oh, no, here's a new weird way to do an anchor tag And the problem is is that you're going to find a lot of different ways to mess up an anchor tag And someone's already done that. There's a software called beautiful soup Um, and we have installation instructions on how to use it and really what it is is it's somebody just spent Months figuring out all the nasty things that could happen and compensated for it and gave you a nice Wrapped interface that just says look you give me the html and I'll give you back the tags, okay? And so it's called beautiful soup And so you have to install this there's a couple of ways that you can install this If you're good at extending your python you can just you know extend and install beautiful soup for all python programs If you can't change your your computer's configuration because you're on a school computer Or you're using a usb stick or something Then there's a way to download this file that i've created called bs4.zip And so what you do is you end up with your file called, you know url links Dot py and then a little folder called bs4 Which is a folder that has a bunch of files in it from the zip file And then you can run it and so it'll pull it in And you'll import from bs4 beautiful soup and that is either going to pull it in from the folder you do Or if you have installed it using the python installer It will also just you don't have to put this file in so it's up to you. You can either do it one or two ways So this is a little bit of code now beautiful soup is a complex Library and so just because this looks easy you the doing things in beautiful soup you might have to actually you know Read a bit more to figure it out, but we're going to just read this we're going to Import beautiful soup we're going to ask for url right here. We are going to take that url We're going to open it url open they give the url and read the whole thing That means we're not writing a loop. We've read the whole thing. That's okay as long as you know that the file's not so large And then we're going to pass the data we got back and this is going to be bytes But beautiful soup knows all about bytes and all about utf 8 and it figures that out and you just say hey Take that stuff. I just got and tear it apart using html and give me back An object a soup object now the soup object is something that you can run queries against so it parses it It deals with all the imperfections and inconsistencies in this this html byte array Um And it fixes that and gives that back and so there's various things you can do and you got to go look at the beautiful soup documentation It could be a whole class on beautiful soup. So here's the thing you can do is this object You can sort of call it like a function And say hey give me back the anchor tags and anchor tags, of course are the tags that say href equals Blah blah blah slash a So all of this is an anchor tag And then we're going to loop through the tags because there could be more than one of those anchor tags in the file And then we're going to pull out that href and that's what this does We're going to loop through all the tags and print out the href So if you tell it to go to dr. Chuck com it will show you the one external link in dr. Chuck com And so i've got an assignment that sort of goes into that in some more detail But this chapter has been A whole bunch of interesting stuff We started with a tcp ip model and talked about sockets that are phone calls between computers And then how applications Protocols are developed to say what we say on those phone calls and we've explored then the http protocol Which is probably the most likely thing you're going to see and then we played with all this in python And saw that a python is really good at this You can write extremely simple and small programs to do some extremely complex and powerful things And again, that's why people like python is because it makes the complex simple We're going to do a little bit of sample code if you're interested in getting the sample code You can download this zip here at python for everybody dot com materials dot php And you will download and you will get all the files Um and all the files that i'm looking at here And so the one i'm going to play with today is the file called url links dot py So the first thing you got to do before url links dot py works Is you have got to install beautiful soup and i've got some simple instructions at the beginning of the file And so one way to do it is install it using Python install process to install this beautiful soup for all python applications And if you are the owner of your computer and you're going to use beautiful soup a lot It's a fine idea to do that But i want to show you a simpler way that if you don't own your own computer and you just want to make it so that beautiful soup works Um, you can download this file Um This file right here Beautiful soup four dot zip unzip it and put it in the same folder As here and so if you look in this folder I have a subfolder called bs4 And that's the unzipped version of this and it has these things. I didn't write this code So i'm sorry if the name is bad, but the this is the code to bs4 And this is what's in bs4 dot zip and it's in the same folder as url links dot py And so what happens is when you do this from bs4 important beautiful soup that either can go to sort of this Global magic place that python install stuff and pulls in the beautiful soup object Or it can go to the folder bs4 and pull it in Okay, and so that's how that works So you have to do one of these two things I prefer to keep it simple download and unzip this file and put it in the same folder as um as this code and away you go So from the previous example, we're going to use url lib, of course And then we're going to pull in the beautiful soup from the beautiful soup for our library We're going to get the beautiful soup object now If you do this with ssl if these websites we're going to play with have ssl You pretty much have to do this little pack And these three lines don't worry too much about it the whole idea You can do google on stack overflow and figure this out But this is the way that you ignore errors when you have ssl certificate errors Um And so we have to add this parameter context equal ctx, which is this variable that we create So this part and this part sort of just do them if you don't you can take them out actually Um, but otherwise you won't be able to do htps site. So let's take a look at what we're doing Other than height uh dealing with the htps problem um Going to ask the user for url. We are going to Um Retrieve all the html. We're going to do a url open just like we did before now This would return us something we could loop through line by line with a for loop But instead we're going to say hey read the whole thing and that basically returns us the entire document at that web page um In a single big string with new lines at the end of each line And this is not an unicode, but it's probably utf8 string But it turns out beauty beautiful soup knows how to deal with utf8 and it also knows how to deal with unicode strings So what we're saying is beautiful soup read through and deal with all the nasty bits, right? So html is like very very flexible. So drchuck.com page one htm And so if we take a look at the source of this view page source Make this bigger You know, you might be able to do regular expressions, but do it does things like break stuff across lines There could be a line break here. There could be all kinds of things, right? And so writing, you know regular expressions or splits or whatever is really hard for html And so what we do is we someone has written this it's called beautiful soup and it's basically This is the code and it's it's based on a joke from a children's story. Um It it basically someone has just went through and figured all the bad things that could possibly happen when you're reading and parsing Uh html. So either you use it or you will slowly but surely derive all the things that it doesn't work and so When we look at this line right here This line at a high level is saying We're giving you ugly nasty html that could make no sense whatsoever Please read it and have all the brains that you have And all the weird stuff figure that out for us and give us back an object I happen to call it soup. You don't have to call it soup An object and that is a proxy for that html But this soup object is clean And so what we can do is we can sort of retrieve all the anchor tags So we can talk to this object and say ask it give me the anchor tags. What's an anchor tag? Well, if we take a look at this source the anchor tag is the a through the slash a that is the tag It is the tag it is attributes that are on the tag It is the text within the tag and everything and so that's what we're going to get now I call it tags plural not because plural matters at all But because we're going to get a list of tags because even though this web page has lots and lots of tags If we look at say dr. Chuck calm And view source. Well, that's kind of small View page source right And we go look for a Anchor tags We got 45 of them and they all kind of have weird stuff in them, right? So this line will give us back a list of tags It will give us all the tags in this document So it goes the tag goes from there to there And then we're going to do is going to write a loop to loop through all the tags So that's basically hopping like it's hopping through the document sort of like this. That's what it's doing hop hop hop hop hop Hop And it's pulling out the text of the hreth attribute. So it's going to talk pull out this bit right here Whoops. Oh darn. That was so cool because that's a flaw. Look at that. This is my own page There is no closing quote here But it's going to work because html soup is like oh, I know what to do about that I can deal with that So let's check to see if that one works because that's like a mistake But that's one of the things we like about beautiful soup So we're going to read through and then we're going to pull out all the hreth. So This is probably thousands of lines of code that you really don't want to run. So python3 url links dot py And so let's start with a simple one http colon slash slash www.doctorchuck.com And it reads it. Oh, that's the no, that's That's actually the card one because we got a whole bunch. Let's see if sugi see the sugi one worked It found that one It's right after Sakai project.org. Where is that? Is there another sugi? Oh, no, I didn't find that one. That's kind of funky. Look it found it wrong But that's okay. So you see it found all these and did a lot of nice stuff for us if we do it python3 url links Dot py and do the easy one htp colon slash slash dub dub dub doctor dash Chuck calm page one dot htm We will only see one And there we go now The the ssl is if you are looking at a page that has a ssl python url links too. So i'll go to like https colon dub dub dub si.umich edu And that will get a bunch of links. And so you'll see If it wasn't for that So all kinds of stuff coming back And if it wasn't for this bit right here and this bit right here This https wouldn't have worked and it's not that that website had a bad url It has um, it has a certificate that's not in python's official list Um, and so the the url is okay So that gives you a quick summary of Using the beautiful soup library in python along with the url lib Hello, and welcome to chapter 13 web services. So what we've been doing so far is we've been using the request response cycle We've learned about sockets. We've learned about url lib And we've actually learned how to pull html and even flat text off the internet But what we're going to talk about now is using that same request response cycle to to retrieve information that is specifically designed For programmatic consumption so that you know, we had to have this beautiful soup which sort of did a hack job or Hard solve the hard problem of parsing html But why not produce data in a format that makes good sense to a program because programs want to talk to each other If you recall the whole idea of a socket is to have one application process sending data to another application process And so if we if we think about this for a moment And we realize that we have all these programs They could be written in different programming languages and they're all connected And so they might want to send data back and forth or through the network php programs javascript programs java programs And so we have to decide on a protocol that is independent of any programming language And then we call that the wire protocol because if you were to sort of take some connection and watch The exact characters that go back and forth. That's what you would see if you are monitoring the wire So that's what we call that the wire protocol And so the idea is is that we have to agree on a format that is going to represent the data And we can't make it a python specific format or a java format And when we take the data from the internal representation, maybe a python dictionary to send it to the wire We call that act serialization and that is going from sort of the internal representation To the serial representation or the wire representation And then here is an example of a person with a name and phone number With using less thans and greater thans This is an xml example and then in a far end in a different programming language It receives this and then deserializes it and then turns it into some useful structure Inside that programming language and so this is an example of a wire protocol that's using xml And that's one of the formats we're going to talk about Another format that we're going to talk about is a format called json javascript object notation And it is simpler and easier, but it's not as Precise and descriptive as xml is and so while you'll find that most of the things you run into Especially if you're talking to apis of one form or another You'll find that json is very common xml still holds sway in places like documents So if you look at doc x at the end of a microsoft word document Doc x means that it's an xml version of the representation of a word processing document So the first thing we'll talk about is xml So one of the two ways that we mark up data is xml the other of json first we'll talk about xml We'll talk about xml more for a longer time when we talk about json xml stands for extensible markup language There was a number of markup languages in the 90s that were out there ways to send data between computers And none of them was like amazingly better than the other but in the late 1999 early 1990s as html came out The idea that we could use less thans and greater thans, you know or angle brackets some people call them Um Once html made angle brackets popular as representation format It was pretty natural that we would find a data representation format that would take a similar approach and So inside xml we're going to talk about tags we're going to talk about attributes We're going to talk about data and we've already talked about talked about serialization and deserialization Serialization is the act of taking data inside of a computer in one programming language setting it up for transport transporting it across and then Taking it back apart and turning it back into the data in whatever internal data it needs to be In the destination system. So here's some basic xml So we can take a look at the various things that make up the xml So it's very much like html and that we have tags less than greater than the differences We get to name the tags anything we want rather than the a tag or the p tag or the h1 tag And there is a beginning tag and the ending tag and they're bracketed together and there's syntax errors in xml Syntax errors in xml are more severe than syntax errors in html It's supposed to be right and if you send bad xml, it's likely that the far end will not understand it So we have a beginning tag an ending tag and so like name and slash name or a beginning and ending pair Then there is the actual textual content and that is the material between it And then here's a phone and slash phone and we have this thing called the attribute key equals value The key doesn't have double quotes. The value always has double quotes And this is this is like href equals on On an anchor tag and sometimes you have what's called a self closing tag where you don't actually have a closing tag You have all the data that you need In the attributes and so you don't even bother putting an empty text area in in a closing tag So that is a start tag an end tag Attribute and then a self closing tag. Those are some basics of xml In general xml doesn't care too much about white space It does in the text areas So in here it matters and here it matters But things like we can indent this a little bit differently And we tend to indent it in a way to make it look reasonable Although once you have programs sending it back and forth they tend to send it More compacted just for efficiency purposes So one of the concepts is that there is a hierarchical structure within an xml document And there are parent nodes and child nodes And you can think of these as simple nodes that is a tag in some data Or a complex element that has a tag that includes other tags some child tags And there's a couple of different ways we can take a look at this The the simple and more natural way to think about this is a tree with parent child relationships So here we have this a tag on the outside and that's the top level one You can only have one outer tag and you can only you can't have you know another tag down here So you have to have one tag that's sort of the root tag for everything in this xml document And it has two children so that the c tag and the b tag are two children So the b tag is a child of a and then c has a d and an e tag that are children there And then the the textual data we model as a as a child Of each of those tags and so that and you'll see in a bit why it's best to do that So that is the way to think about this as a tree to represent that xml As a tree if we add attributes to it and this is where you kind of see why it's nice to take The text area and make that be a child of the node an attribute is a different So the text is a special kind of child and you can literally have more than one attributes You could have x equals Two you know zap equals whatever and these could have a couple of different attributes The w attribute is a value of five and that's the five down there And so you could have multiple ones. You can only have one text node Now in the case of a you have a whole bunch of text nodes But these are because there are child nodes Within one simple node. You can only have one text element You can also think of xml as paths and the easiest way is to sort of look down this tree version And look at from the path from the parent So you go to a then the parent of the child b and then x so at position ab you find x So ab is the path up to the root So acd that's this one Is the path to y And a c e is the path to z and so you can think of these as paths Part of what we're doing is we're coming up with ways to Walk through and parse trees of xml data So the next thing we'll talk about is how we determine if a particular xml document is legal or Meets the contracts the two applications have set up We're going to do a little bit of code if you want to get your hands on the code go to the materials Website materials dot php Actually materials dot php and download the sample code The code that we're going to work on today is the xml code and we need to be able to talk xml To work with web services. And so here's one of the examples from the book. It's xml one dot py And so later we'll be pulling xml and json from the web But for now, we're just going to put it in a triple coded string So data And we're going to use a built-in xml parser in python called element tree And when we say import xml e tree element tree as et this as et gives us basically a A shortcut handle for it And so the idea this is a string it has less sends and greater thans It looks like structured information and it is but really at this point It's only a string now. We have to call this et from string to read this and give us back a tree object And what it does is this might blow up this code might blow up right here if there was a mistake in it Okay, and um matter of fact, I can probably put a mistake in let's see if I can delete this And save it and run this code and we'll see that it will blow up Right and so it blew up here in line eight element tree blew up It are I mean it blew up in line 12 of the code which is right here This failed because the line eight of the xml string was wrong. So let's put this slide that back in So now it's properly formed xml. So this tree we get back I name it tree just because I always name it tree, but you could name it x So the key is is tree dot fine goes and looks for a tag named find and it tree is no longer got less Then's and greater than's in it. It is went and and turn these into objects within objects within objects So tree fine name says I would like to find the tag Name and that's what this bit is right here And then dot tx dot text is going within that and grabbing that text, okay And if we say tree find dot email, then that's going to give us this And then that's that object and then dot get Asks for the contents of the hide attribute, which is the string. Yes Okay, and so if we run this Now that it's fixed python 3 xml one dot py It will pull in and get the at the name And the attribute so it's it pulled the chuck out and so you get this object and then you kind of dive into that object and so that's xml 1 dot py it if you've got a tag You can either get the text out of the tag or you can get an attribute out of the tag So now let's take a look at xml 2 dot py again. We import element tree And we have a tag and there's xml's always got to have a single outer tag But this time we're going to have an effect a list Um, now let's let's line this up a little better There we go. That looks a little prettier And so users not the fact that it's users doesn't mean anything but we often Come up with semantically meaningful names for these things Users is going to have as its children a list of user tags Okay, so the children under user users user under user and then this has each of these as a tag So we want to parse this and this is a common thing we want to do Um You know and so again the first thing we do is we read the string to just take this It's a triple coded string going from here to here And then we're going to instead of doing fine, which gives us one tag We're going to do find all the users tag the user tag that is a child of users And we get back a python list of the tags not of the text, but of the tag So there's a one tag and there is another tag and so we can do len of that So we can see that we got two and then we can write a for for loop And the this item is going to iterate through the tags that are the user tags that are children of users So the first time item is going to be this tag a tag remember And then the second time is going to be this tag And so we can do things like find and and get just like we did with the in xml one So running this is not too exciting python 3 xml 2.py You see that there are two users that comes from This print right here. There are two users in there and the first one if we go into name And we go find the text within the name tag Within user Then we get chuck and then we get the id which is 001 So we find the id within that item and then we get the text and then we look and we grab the x attribute Off of that and so we see chuck Chuck 001 and 2 and then in the next tag we the for loop continues and we print that out Okay, and so that's just a basic run through of the the xml from The the chapter in the python book. Okay. Thanks So now we're going to talk a little bit about xml schema xml schema is a language That allows you to decide on whether or not a particular xml document meets a contract in arrangement So you have two pieces of software exchanging data using xml And what if one of them if they're all working nobody really worries too much about it But if all of a sudden one breaks you change one side another one breaks whose fault was it, right? Was it the side that got changed or the other side? And so you could argue So what you like to do is before you set up these arrangements between these applications set up a contract In a way, they're kind of like the rfcs are Except that their scope is between pairs of applications and so that is it itself is xml and It it basically what we do is we we take an xml document and an xml schema contract And then we either say that's good or that that is bad and that's called validation A piece of software that validates xml when given a schema is called a validator. And so an xml document Here we have our little xml document. We're passing it to the validator and then we have a schema contract which is a Itself xml. It's kind of a particular kind of xml that xs colon complex type That's just a tag colon is a legitimate character for the name of a tag Name equals person. That's just an attribute. And so xml schema is a particular format of xml that a lot that That renders an opinion about what xml is supposed to look like So there's a number of different xml schema languages the one we're going to look at As one that's kind of came a little bit later. That's very common Called xsd, which is the worldwide web consortium's schema specification Often you'll find files that have suffixes of dot xsd that actually contain The xml just like we're going to show you So if you recall, there are simple elements which have text children And then there are complex elements where other other nodes are children of other nodes And so we can say this and so here we have a little bit of xml and the xml schema that makes sense with that So um, what we're saying is the outer tag of this legitimate xml is supposed to be a complex tag With a name of person. And so there we go. That looks good. Good good Then there is a sequence And then there is a simple element The name of last name looks good and it's a string that looks good Uh, uh another tag that's of in it of type of a named age. That's of type integer. That's good and then a Thing that's called date born and then there it looks like a date So we check all these things and we can basically say yep You know that is a good xml document according to this schema and You don't have to write this generally, but there is software that reads these two things and comes back with a true Or a false and not even have some detail as to what went wrong With this particular schema Here's some more that you can do with a schema We can do things like, you know, have a complex type. We have a sequence here We have a string full name and a string child name But we have this min occurs and max occurs. So min occurs is the minimum number of times it can occur And maximum is the maximum. So min occurs equals one max occurs equals one means it's required And so this is required and we don't have two of them two of them would be an error One of them is fine. So that's good here. The child name is Min occurs zero max occurs 10. So we have four here And so that's good too. And so that is another kind of xml schema constraint that you can have Here's a few other data types that we can do We've done the string we've done the date the date looks like this dates are four-digit year two-digit month two-digit Day with dashes now, there's lots of different ways to represent dates But the nice thing about this and you don't you have to put the zeros in so zero nine for september It means that these are sortable as strings so that if you do all your dates this way They're sortable as strings. So you could argue what is prettier, but for computers We don't worry about that. We're arguing about what's the most functional And then the date time Is that same date format with zeros followed by the letter t letter t and then followed by Hours minute seconds zero filled right so nine o'clock is Zero nine and then the time zone which we'll talk about a second in the next slide You can have decimal numbers and you can have integer numbers as well And so we are able to sort of render an opinion as to what is good and what is bad in the resulting xml So dates are kind of interesting. There's again, we have lots of different formats of dates, you know Nine ten, you know nine slash ten slash two thousand and two right? You know, that's that's a format of a date, but that's that's one. There's another another format of the date, which is, you know 12 december Two zero whatever and so this is how people show dates Computers don't want to have all those different dates and don't want to figure those out They have libraries that produce dates and make them look pretty and for particular locales, but Computers really want dates that work best for them. So we just say, okay, we're going to have this year month day Time and then zero fill hours minutes seconds h m s and then time zone Now computers even prefer a time zone Uh, if you I don't know if you've used something like your google calendar And you take a flight or take a train trip and you want to put a different time zone everything switches And that's because google calendar is not really storing the time zone That you're it's not storing the dates Related in your current time zone. It's storing them in what we call universal time or greenwich mean time Zulu time is another word for that and z means this time that is the time in You know, london england greenwich mean time um And so the thing is is that that means if this data moves between time zones or crosses international dateline Or standard data like savings time or anything like that None of that changes and so so we have this internal Date and time that's very common in situations where computers are exchanging data That then gets shown with a time zone converted to the time zone or the local format That's the right way to do that and there's a standard for how dates and times are supposed to look So here's another little example of uh, some some stuff. Let's see what we got now If you see this little question mark xml That's not a problem. That just is a way of sort of putting a header on the whole document that says it's an xml document Telling it that's a utf 8 document Um, and that that's not really a tag that's sort of like a marker on the file So that you can put that there, but it doesn't harm the xml The outer tag is this tag right here excess debt polling schema and then, um What else we got? We got an address. We've got a string string string string string. We've seen all those Here we have country and we're going to have a restriction that basically says this is a simple string But we're going to make it so that you have to list One of these four as the country code and so here we are down here and that's uk and that's uk and so that is A valid xml Another couple of examples here Uh, let's see a string string string string string Uh max curves unbounded. That means infinite number. There's no limit on the number. You can do that Um in occurs of zero excess positive integer. We've we've seen integer, but you can also say it's got to be a positive integer A decimal we've seen that and then u sequels required is just another statement that you could make I'm not trying to get you Uh to the point where you can do xml schema just get you a sense of the kinds of statements that we can uh speak about when we're talking about what is and is not legitimate xml So let's talk a little bit about how we might talk xml inside python And so like most things that are in this extended part of python We have to import something and so this is the name of a library xml e tree element tree And then as et this ends up being a shortcut So we don't have to type these long things and so et is the same as typing that it's almost like a macro Now normally this xml is going to come somewhere from the network, but i'm just going to put this in a string I'm using a triple quoted string And so that means that this triple quoted string starts here and ends here And all these new lines that are here are actually part of the string So this is kind of like I opened a file and read the whole thing in but just to keep this Totally self-contained. I'm putting it in a string. So the xml would come from Some server on the other side of the network. We would get this xml. So that's that's how it would normally work. Okay, so This is the xml right there and We parse a string of data And we call et from string So we're passing in the less thans the new lines the greater thans all of this stuff We're passing in and this could have syntax errors in it. So this might blow up If this had a syntax error like we forgot the little slash or something there was a syntax error But this doesn't have a syntax error. So then what we do is we get back an object I just happen to call it tree because it kind of is like that tree version of the xml That is an object that we can then query to pull data out of it. So we say tree dot find And look for a tag name name So that finds the tag name name is this it's it's everything It's the tag and the text if we want the text we add dot text and then that dot text That dot text that actually Refines it to only the word chuck And similarly if we do tree dot find email That tree dot find email That finds the email tag, which is this tag it has a child attribute and you can get and any of the attributes you say dot get There's only one text child, but there are many Attribute children and so you have to tell it which one you want and so this said this this here This bit right here all of that will reserve will resolve down to that string. Yes That's what you're going to get there. Yes and so you You kind of build up these little finds and call methods This is not a clearly a full introduction to element tree But you get the idea that you sort of dive down in With these methods the call methods the call methods to get little pieces out and parse all of that Here is a different example In this one again, we're using triple quoted string We always have a single tag on the outside And then I have a complex type of users and in it there are two user objects. So this is kind of like a list Right. So this is more than one of these things. So this user can occur more than one time And again We take this we pass that into from string and get back an object that represents The name stuff is not necessarily have to represent be the same as this outer tag Just a variable this could just just as be easily as x if I wanted So now what I'm going to say is hey, hey stuff I want to find the tag the path user slash user I want to find all tags that match user slash user. So that's going to give me a list of two tags one tag Two tags in a list Tag Tag Oops So two tags Now I can print out how many I get that'll be two in this case because I got two tags and then I can actually iterate through the list All right, so I can I can iterate through the list So this item is going to iterate first to this tag And that tag now it's like it's like in the previous example We can look for the name tag within there and pull the text out So we pull that text out find the name tag find the name tag and then within that find the text and we can find the id tag and pull the text of that out so that pulls out this zero zero one and I've scribbled too much And then we can item which is this is item is that whole tag dot get x so that gets the attribute that gets the two that two comes down here okay, and then Item goes to the next one because item is looping through so item iterates down to that one and pulls out The name dot text the id dot text and the attribute dot x and pulls all those pieces out So this is the basic pattern you saw one where you just you're tearing into a single Um a single thing and here you're tearing into something that is expected to occur More than one time and so that's a quick summary of how you talk to xml in Python up next we're going to talk about the other serialization format JavaScript object notation So now we're going to talk about the other serialization format javascript object notation Chances are good as you go out there You will very likely encounter more json than you will xml not that xml is bad xml is better for rich and hierarchical documents whereas json is best for Just pulling data out of a system and moving it between two systems with the minimum of fuss This is douglas crockford. I have a great interview from him. He's a funny guy very very smart He claims he didn't invent json. He discovered it because it really is based on The literal notation for javascript and it actually looks a lot like the python literal notation for objects and for lists Now douglas crockford is a quite a sense of humor He wrote this book called javascript the good parts That's the little one right there and then javascript the comprehensive guide in the the sense of humor is all the stuff That's in javascript. That's not too useful and while this is sort of a tongue in cheek It also is trying to say that javascript What crockford is really saying here is javascript is a great language As long as you avoid the tricky bits and sort of keep it very very simple And javascript is indeed a great language, but but json comes from javascript You can read about json at json.org. Uh json is not a international standard. It's not like an rfc It really is douglas crockford decided to register json.org and typed in some pages and people started reading it And people started using it and partly that was because it was truly derived from the javascript javascript literal syntax So we're all ready to code. Here is some python that's going to process some json Keep it straight python process json. So again, I'm using the triple quoted string here Now you'll notice the syntax that we are using is not angle brackets, but instead curly braces And so the curly brace and then within the curly brace you have key value pairs name colon chuck and the Key colon value and both sides have quotes You can also have objects within objects curly brace key value pairs key value key value looks a lot like python And then you can do this and so this is a structure that has one key value pair That's a string another key value pair that's an object another key value pair That's an object and then these are key values within those contained objects. So this is a string That again probably was retrieved across the network from some other place And we're going to pass that string into You know the json library load s load s stands for load from string So it reads this parses it looks at all the white space white space again Doesn't matter too much here unless it's in between double quotes the white space doesn't matter And so it parses it and then returns us a Dictionary so the thing that's different about json is that its Structure and representation are simpler than xml So in in python everything either comes back as a dictionary or a list or a dictionary within a dictionary or a list within a dictionary But it's all dictionaries. It's not a separate structure that you have to do gets and finds and find alls and look ups So it's right there. So when we get this back because This is a curly brace info is a dictionary And so we can just use the standard syntax of python info sub name. Well, that will bring Let's clear this So info sub name will will go find chuck So if you compare that with the xml, that's just a lot easier Now when we have info sub email, that's this thing. So info sub email is that thing and then sub hide Is this so that's what comes out here Okay, so it's really nested dictionaries and lists We haven't seen a list yet, but this is a set of nested dictionaries that it parses And it's equally simple in other programming languages This is a little more complex version where the outer element is A square bracket, which means it's going to be a list And so we have a list of one comma two things. So this is a list Of two dictionaries So there's two dictionaries inside that list So again, we take this string and we load it into You use the json parser to read the string and give us back In this case info is a list. It's got two items If we print out info, it'll give us two and we're going to iterate through And so if we're going to iterate through Item is going to is going to first be this And then it's going to iterate to this And it's going to print out item sub name, which is going to print out chuck Item sub id, which is going to print out uh, 001 Now you'll notice that there is no attributes and that's because json is simpler But we can have the x and it just is another item. So we say item sub x and that's going to print the two out And then it'll iterate to the next one and it'll print out the same thing for those guys and so Json is simpler because it is you can't represent as complex a data structure Or you have to compromise and map it into a simpler data structure But then it is lists and dictionaries and so once you've got it parsed It is easier to understand and to make use of So that was quick. So that's part of why everyone likes json better Is once you have come up with a format that you're going to send it back and forth It's easy to make it and it's easy to read it Now what we're going to talk about is sort of moving up a level if you've got all these data formats and URLs that you can hit to pull those data formats down What Approach do you do as you start to construct applications that increasingly go from a single application to a networked application? we're playing with the web services chapter right now and If you want to get the materials for this course you can Go here and download the sample zip sample code zip. I've got this all sitting already on my computer I also have the whole thing in github if you want to get it out of github So the thing we're talking about now is we're talking about the json 1.py example from the book And so json is kind of like xml except a lot simpler And that's why a lot of people like it. It's not that json is always better But json is is better in a lot of situations that don't require the complexity of xml So we always we start to import json json is built into python But we have to ask to import it again. We're using a triple quoted string to put the json in there And json looks a lot like python dictionaries key value pairs key value pairs in this case This is a key and the value itself is another dictionary or in json terms an object But again key value pairs within key value pairs within key value pairs And all these little cursor guys have to all these little curly brace guys have to line up properly And so uh like all the time this is Uh a string which we normally would read and decode from the internet But for now we're just going to have it in there load json.loads says go into the json library pull out load string and parse this which turns this set of curly braces spaces commas and perhaps syntax errors into a structured object And if we had made a syntax error in here then this would blow up But if this doesn't make a syntax error if this doesn't blow up then we have a structured representation Now the difference between xml and python json is that this turns into a python dictionary with key value pairs Okay, and so once we have this this is a dictionary and we can say Info sub name and that's the exact syntax that we would use to get the dictionary and that's going to extract this value out of there And if we want to go in deeper We can say info sub email and that's what info sub email is right there and then sub hide So that's like that's a dictionary within a dictionary. So if we run this python three json One dot py It digs in really fast. And so this is why people tend to like json is because You read the json, which is actually a syntax derived from javascript But it looks just like the syntax for a python So it's moving and that's moving an object a json object that turns in directly into a python dictionary with nested dictionaries So now we're going to look at json 2 And so json 2 we're going to see a list Are an array in json terms, but it turns into a list in python terms. So this is a list Of dictionaries in javascript. That would be an array of objects, but in python. It's a list of dictionaries So we'll just pretend that it's a list of dictionaries again We load the string parsing looking for syntax errors. So let's just make a syntax error here and run python json 2.py and you'll see where it blows up It blows up at line 15 Which is right here. It's like this low desk blows up now You could put a tricep around it to save it, but we're not going to do that And it even complains it says look we're expecting something here in line 11 and that's line 11 of the json which starts at line 4 And so I'll put my little square brace back in so it's not syntactically broken So let's run it again and make sure that she runs and yes, she does so this parses it And converts from the json syntax into a python in this case list Because it's got square braces instead of curly braces the previous example had square braces And we can then take a len of it. It's a and it's an array. It's a list And we see that there are two things in there and then we're going to iterate through and this item is going to iterate through These dictionaries that dictionary followed by that dictionary. So the first time it's item sub Name, which is this value right here and then item sub id, which is this value So you can dig right into this, but you're using you're not using get and you're not using the weird extra find or find all or anything You just are going at these structures directly and so you can quickly extract this stuff out and we Read through id use Name is chuck Name is chuck The no, there are no attributes by the way x is two and so we had to make x so if you look at the xml We had this concept of attributes on the outer tag there. These things are also not named We just have to know what we're looking for. So json represents simple structures, but it's all it's much simpler to use so I hope this has been useful to you And talked to you in a bit about some more json So the service-oriented approach is a way we Approach solving a complex application problem where all the data really isn't present in one computer system It's somehow spread out Over the internet connected by the internet or internal network And so the the idea is is that Some applications just can't contain everything the the perfect example is a travel website That can book you a flight book you a car Buy tickets Book you a hotel and do all these things Well that travel website is neither a hotel nor a rental car company nor an airline But what it really does is it talks to all these services somewhere else on the web on your behalf And it makes reservations for you. And so you have this convenient user interface that says, oh, here's your whole vacation I'm going to figure all this stuff out Now you say go and it goes book book book and books on all these other other systems Now it requires a lot of infrastructure a lot of coordination and a lot of Effort to make sure that your application can talk and these other services that are out there in the internet Have good contracts and you know exactly how to send data to them and get data back from them And so initially when you're building a service-oriented architecture often you have one application and it's all internal Often it's all one language and then maybe you'll say, oh, wait a sec We want to take part of what we do and put it in a second system And then sort of come up with a set of rules between the systems and then more and more and more So now that we're solving our problem using a series of cooperating applications Communicating across the network. We're going to talk a little bit more detail about the notion of what we call web services And in this we're going to take a different perspective instead of building our application and breaking it into pieces We're going to have an application that's going to really consume an api from somebody else So there is some other provider of this api. That's not us And so that if you're going to talk to somebody's data like google or amazon Or twitter They're going to say you have to use our api. So what's that? So an api is a contract that says look if you do this and this and this and this We're going to give you data this way and they set the rules They tell you what the URLs are they'll tell you if it's xml or json And this is called the application program interface and it's something you read and you understand And so you go look at the documentation. This is the documentation for the google maps api So it turns out that google knows a lot about maps It knows a lot of data it knows how to search maps And it actually provides some of those features to you that your application can take advantage of I took advantage of this at one point by Asking all the students in one section of one of my online courses where they were from And I just let them type in where it was and then I said well I don't know how to code any of that So I use this api doing what's called geo coding to look all those places up and get precise Latitudes and longitudes for the ones google could figure out and that saved me a lot of work now These are expensive resources, but I could be patient and make use of these resources which As long as you use them, not too much. They can be free We'll talk a little bit more about rate limiting and what's free and what's not in a bit But you start by reading documentation and it says do this hit this url hit that url So if you read that documentation You will find that There is a url that you can hit and they tell you where to go And then you go to this url You had a question mark and then you say address equals and then an hour plus and there's all these rules These are called url encoding rules when you have key values on urls The plus means space and percent to c means comma. So these are called url encoded But don't worry too much about that because we're going to have a magic library like we always do in python That takes care of this and so if you were to hit this url You type it in the exact right way in your browser. You will get back adjacent document It's an object that has key value pairs The first value is this status then it has these results and it's a list and you dive down and eventually you can kind of find the latitude Longitude of the thing that you are looking for and so the idea is can we write a program that can read this And so here's our little program that reads this and a lot of this is sort of Comfortable you've already seen some of this You import url lib We have to parse them json. We grab the url And then we're going to write a little while loop that's going to ask for a location and we can type that location in And we've got to concatenate With this url the location equals and there is a bit of code a library that called parse url and code That takes the key and the value So the address equals and then whatever this text is that we read in from the user that goes in here And it does that url encoding with the pluses and the percent 2c And all that stuff is taken care of and that is our url that we're going to pass To url open so we print out that we're going to retrieve it Prince this out and if you look at this it's too long. It has all that fancy stuff on it And then we read it I mean we open it with url open And then we read it and decode it. So these two things hit this url Decode it and then we retrieved 16 69 characters because it's just a In this case because we've decoded it data is a string now that's read as bytes and data is a string So we read uh that many characters 16 69 characters And then we're going to take this data and we're going to parse it with json And we might get bad data here. It might blow up, but it might work. And so in this case it works Um, we have an error that basically says if we got a bad thing we're going to blow up But in this case it doesn't blow up and so now we're going to sort of dig through and If you go back, let's let me just go back So the results sub zero geometry. Let's show you how that works So results is the first key So this is a dictionary with a key of results, but then it has a list and the zero item This list starts here and goes there and there's i'm only going to show part of it, but there's many things here So the zero item is this this is the sub zero And then geometry within that sub zero item. So if we look at that it is The outer outer dictionary the first item in the list sub geometry. So that grabs one part That grabs This part right here And then we're going to go into location and lat and those are just keys within keys a dictionary within a dictionary And so you see it says Sublocation sub lat and so that is literally going to pull out of that complex structure That will pull the latitude out and then in the next line pull the longitude out So we can pull the latitude and longitude out and then we print it out We can go into results of zero formatted address and that goes into results zero Formatted address and that pulls this little bit out Now it takes a little while write this stuff and you have to put a lot of debug And you don't necessarily figure out this complex bit here at the end But you know you print it you don't get what you want you say oh wait a sec That was an array so I got to add a little sub zero there to get the first one out of the array But eventually you figure it out and it's not all that difficult It's the first time first few times you do it and like what am I doing? But after a while you realize oh, I'm just sort of tearing this apart and digging deeper and deeper into This data structure, which I just retrieved over the internet from google and I learned something good from that So up next we're going to talk about how sometimes these apis protect themselves with keys or signatures And why that happens and how to solve those problems We are doing some code samples here if you want to follow along you can download the sample code all is the in a big zip file I've got it. We are going to be working with the google maps api Uh in the old days this maps api was free and did 2,500 requests per per day But now they've made it so that parts of it are Behind api keys and you start having using o-auth and stuff But not they haven't put it all behind this one address service that we've been using that continues to work And the basically idea of an api is you go read the documentation you find A url and this is going to google servers and you pass in The address and and we have to pass in the address using what's called url encoding. So spaces are pluses That's a comma and then that's a space and so we have to pass this in a certain way But if we do it right We hit this we're going to get ourselves some json back and that's really cool And so deep inside here we get the real address, you know a good address we get a geometry Um You know we have the location we got the latitude and longitude and we can extract stuff out of here And so we're talking and this one here is still rate limited to 2,500 But it's one of the few parts of the google maps api That is not hidden behind an api key in a later chapter We'll show you how to actually talk with the api key in the geo data code The geo load uh shows you how to use an api key if you uh if you want to jump ahead and take a look at that But for now, we're just going to take a look at geo json, which is going to retrieve one page and tear it apart. So let's take a look So we're going to grab the url web stuff and import json So now we're going to use json, but we're going to actually pull the data out of The out of the internet and so I just take that service url for google maps api I found that somewhere in the documentation And then I'm going to have a loop that's going to run forever. I'm going to add for the at the location And then if I hit enter, that's what this is saying get out of the loop And then what I'm going to do is I'm going to concatenate the The service url, which is this and this url Parse url encode gives a dictionary of address equals and this this bit right here Um gives me the string that leads to Putting this address equals, but encoding these spaces the right way. So if you type a space That bit of code turns it into the plus. So that's important And I've got the question mark sitting here at the end of that Then what we're going to do is we're just going to do a url open to get a handle We're going to read the whole document and because it's utf 8 coming from the outside world And we want it turned into unicode inside our application. We say dot decode We can ask how many characters we got And we put our json load s now up till now We've been just doing load s's from internal strings, but this is now a string that came from the outside world and We'll put a try accept in and we'll set js to be none and that'll be our little trigger Now we can look for They give us if we take a look at the output They give us this okay and that status can be a problem and it can complain about things So we have to check to see if we got a good status. So At this point if you look at the outer bit of this the outer bit that we get is A curly brace. So it's an dictionary. Then there is within that dictionary a key results, which is a list But then the second thing in the outer dictionary is status and so we can ask if the the word the the um If we got a False if we got nothing that we'll quit if uh, we don't have a status key in that job in that J the object or that dictionary or it's not equal to okay Any number of those things if this or this Or this are all either of those are true We're going to quit failure to retrieve and print the data out And when you're starting to read stuff all the net you often have to put debugging in here like this to like Oh, something quit. I got to figure out some and so debugging it Next thing we're going to do is called json dump s, which is the opposite of load s which takes this array that include Dictionary that includes arrays and we're going to pretty print it with an indent of four And then we're going to print that out And so if if you look at my code, we'll see that the first thing we do once we parsed it Is we print it back out so we can see it And then we're going to dig into it. So let's go ahead and run this code python geojson dot pi R One of these days I will always type python 3 And arbor comma michigan Okay, so it ran and so you see that it retrieved this url this url was constructed and retrieved 1736 characters And it's json pretty printed with an indent of four And this is that this is that json dump s all the way down to here So that's just json dump s and then it starts extracting So it's going to pull things out now when I when you write this code, it's really easy to look at this and say, oh great It's easy I tend to have to print this stuff out over and over and over and as I kind of construct this expression But if we look at it The outer dictionary the outer dictionary sub results Leads to this array And if you go look at this array carefully you find there is only one thing in it And so that the results is an array sub zero Gets us This this dictionary I keep on to say object because that's what it's called And that goes all the way down to here So that's what we get there and then within that we now have an object and we look for Geometry within that object where is geometry right there geometry Geometry goes from There to there. There's geometry in there. You got to get used to it. That's why it's nice to have this stuff indented geometry sub low oops come back come back And then we go to location within that so location within geometry and then within location we have lat and long And so this is pulling out this 42 and 83 And then so we print that out Take a look and that prints that out pulls that right out of the json These are tricky to write but after a while you win and you get it right and it's just fine okay And so we do the same thing Results of zero formatted address gets us this and so that's how we print the location out And so that's a real quick look at how we would do that With the json talking to the Google maps api Okay, hope this helps Now we're going to talk about api rate limiting and security But the key thing is is that the google api and the google data is super valuable And you could build a website that did nothing But sort of like asked the person for something and then showed them that place and make it be a map searcher And you added so little value and google did all the hard work And so they protect these someone sometimes they'll say you can only do 50 of these a day or 500 a day or whatever That's called rate limiting and sometimes they say you've got to log in You've got to create an account and get a key with us and then present your key So that means that your account only gets so many and they keep track of who's using their service and how much they're using it Google gives you even sort of a dashboard that tells you some of this stuff. It's kind of nice and so and The other thing is is sometimes an api is free and then it becomes popular and they decide they're going to put a key on it Or a rate limit on it So you got to kind of play this game with them and the rules kind of change As things progress so that geocoding api That we're talking about has has at one point in time 2,500 requests a day You can get more requests if you get a key Now another api that we can talk about is the twitter api now a twitter api started out as a free public api But then twitter realized that people were making more money off of twitter's data than twitter was making off of twitter's data And so twitter makes it so that you have to have an account You can only you can only request data from their api is if you use your account key To sign that and so there's a whole series of getting and issuing keys And then using those keys And i'll just give you a short summary of the kind of code that it takes to build those key build those requests up that have to be signed So you'll look through the the twitter documentation and it'll say oh this url to get the tweets et cetera et cetera And it says do a get request to this url and that url and maybe substitute a little bit of things here For the screen name you're looking for or how many tweets you want And they tell you how to carefully construct these URLs And so here's an example bit of code that talks to the twitter Now for for now i'll ignore the security bit That's all hidden in this tw url So it looks a lot like the last one we're going to use json and url lib And we have found that this is the api name Blah blah blah blah list dot json getting a friend list for a particular person And so that is the base url that we're going to do And we're going to ask a person for a twitter account if we hit enter we're going to break out And tw url augment we're going to say give me the first five friends of this particular screen name The one we just read in from input And this tw url you'll see in a second It adds a bunch of stuff to prove that you are who you are It's signing that url So you're sending a signed url, which is nothing more than a whole bunch of crazy characters We'll see that in a second we retrieve it and this is pretty straightforward We can just you know open the url Read it and decode it decode solves the utf-8 thing Makes it all so that data is a real string and it's in the unicode internally Now we can actually get the headers remember i told you earlier that url open Bypasses the headers, but it's stored them for later and we can say hey give me back those headers And it gives us back a dictionary of headers and the headers if you go all the way back are a bunch of key value pairs Key colon value in the headers and in twitter if you read the documentation There's this x-rate limit remaining that tells you each time it returns to the api Response to the api call that you made it says look you've got 12 left you've got 11 left you've got 10 So you can print that out. So this prints out how many you've got left Then we parse the json data We're going to print it so we can debug it This dump s dump dump to string and then print it indent equals four. This is called pretty printing And it's indenting things really nicely So that you can make more sense of it whereas when these things are talking when programs are talking to each other They don't really make the output look particularly printed pretty And then if you We're going to go through we have the outer thing of users and we're going to print out this screen name And go grab the for each user and users. We're going to print their screen name We're going to grab their status text and print that out. And so this is what that data looks like Kind of chopped a bit. So the thing we get is An outer layer we get users and then we get a list And here's the first user now if you look at the actual data, it's much larger than this Here's a second user and then we have status text status text and the screen name Right. And so those are the bits that we're extracting from that if you look We're going to grab the screen name. We're going to grab the status text and away you go. So You can start with this But you realize that once you're looking at this and you're printing this out with pretty printing You can sort of work your way in knowing that it's an either a dictionary or a list If it's a dictionary, you look up the key if it's a list You say which position it is and then you get more dictionaries within dictionaries within dictionaries and away you go And so this code Actually, you know when it runs it prints out the screen name and then that status and the next person So it's my first five in that case my first five friends And their most recent status the first five people now Let's talk a little bit about how this security works And so you have to go to the the website you have to have a twitter account You can't talk to twitter api without a twitter account And then you go to this website And then you set up a key you say i'm going to have build an application That is going to consume the twitter api And then you go in you have to work through there's documentation on how all this stuff works You set up an api key. You set the application. So I made a key called python on my laptop And it gives us some values It gives us a consumer key a consumer secret a token key and a token secret And you get to regenerate these and there's this file called hidden dot py And you edit them and copy and paste all the stuff from those pages those four values into these strings Now if you download my code, I don't have my keys in there. I got some placeholders for this stuff So you got to get to this web page that's on twitter copy these things in And then the twrl code will start to work It uses a technology called oauth which is a way to Sign a url in a way that proves that you have the key in the secret And the tokens and the tokens But uh, and it can't be modified in the middle. So once you send this url They can check the key in the secret to make sure that you truly signed it without actually sending the key in the secret That's actually kind of cool and fascinating, but we won't go into it in great detail here And so if you look at the code In tw url dot py This is the code that does it. It actually pulls in an oauth library that hidden dot py That is that code that you've got Um, and it's got the consumer key the consumer secret secrets, you know, this is pulling that from hidden dot py Uh, this is a lot of stuff that's using this oauth library. Don't worry too much about that Eventually it produces a url that looks like this and the way what happens is this was the base url You were told to use Then you have count equals two and screen name equals dr. Chuck those parts are your parameters to that web service call And then all this oauth stuff Is produced by this oauth code and the consumer seeing key in the secret What happens is the key gets sent the key gets sent the And the uh secret does not get sent but they send this signature Which is based on the secret and then what it does is it re checks the the signature on the far end The signature is a long string by regenerating the signature because this the secret's on available to both you To generate the signature and to them to check the signature. So it's kind of like a hash Etc etc You don't have to worry about all this these urls get really long And your values that you need are in the name of the urls in and you call this routine That's called augment that takes a url and then parameters and then augments it by adding all this oauth stuff And so that's why it's called Augment to augment the url and once you got this set up and hidden working Then you sort of just augment the url and then hit it Now, you know, if you don't have the right keys or secrets or you don't have an account on twitter Then it's going to blow up, but if you get it set up you will be able to talk to the twitter api with this So this whole web services section. We've done quite a bit of stuff, right? We've looked at how instead of reading html or flat text We are creating structured data according to contracts whether it be xml or json We can retrieve and parse that information in a deterministic way We talked about schemas that define the contract so that you know if the data you're getting is wrong you can know who to blame because the schema gets violated and We've played with apis where you're talking to someone else Who's defining what the rules are and how to read their documentation And even if they have an api key or need to sign urls Showed a little bit about how to do that We're doing some code sample code playing through with some sample code samples And you can get this by downloading it. I've got this whole thing downloaded and um, I've got all the files here And these are the files we're going to play with today Today what we're going to do is talk to the about the twitter api and then and The one thing we got to learn about the twitter api is we have to authorize ourselves and so we have to You know make sure that We have a twitter account and then we get some keys And so for in this particular application if you want to duplicate what i'm doing You have to go to apps dot twitter dot com click this create new application button and then get some codes Okay, and the codes show up as soon as you hit this button and then one more button, which i'm not going to do on screen um And so what happens is there are four codes that you've got to put in this file hidden dot py The consumer key the consumer secret the token key and token secret These are just messed up. So i'll show you how this works and blows up if First and then i'll i'll put my keys in here without showing you Yeah, but basically this is a little file you got to edit or these twitter ones stop don't work You'll see what happens So the first one i'm going to do is is do the simplest one of all And that is i call call this thing twitter test and it just is going to go Ask for the user timeline and we can take a look at this and we're going to Take the url And we're going to augment the url. This is the base we found this looking at the twitter api documentation We're going to pass a parameter of screen name dr. Chuck and a count of two So this is just a python dictionary and augment comes from this little bit of called code called tw url and This uses a bit of code called ooth, which is Built into a python as well Right. Yeah, that's built into python as well and it augments the url And it takes the the key the secret the token key and does a thing and signs it and then makes this big long ugly url what you will soon see and Does this it's a signature of the url. So we we passed this data back and forth to twitter With a signature and then they recheck the signature and it's a digital signature that knows that This url came from a program that knows the key secret and token and token secret And so this augment basically is something that i wrote tw url Augment is something i wrote to make it easier to add all these ooth parameters and you feed This code by putting your data into hidden dot py lots of people get this to work So don't worry. It's kind of cool when you finally get it to work So let's take a look at what it does just to just know that this makes an awesome url That does all the security and we'll see one of those urls um So ignore the certificate errors this has to do with the fact that uh ht We're using https and python doesn't have enough certificates put into it by default For a lot of reasons, but our quick and dirty ways to turn them off Thank you python for reducing security by teaching us so that this is the best way to do it That's a crumpy moment from on my part. So what we're going to do is we're going to do a url open This bit here is to shut off the security checking for the ssl certificate And then we're going to read all the data and then we're going to just want to print it out And we're also going to um ask The connection this url remember i told you a long time ago that url Lib eats the headers, but you can get them back and now we're going to ask to get a dictionary of the headers back And so we'll print those out Okay, so this is really kind of just Testing the the body and the headers and printing them out sort of in as raw a way we can do so Let's go run this now. This is going to fail the first time we do it because we haven't put the Hidden variables in there. So if I say python 3 tw test dot p dot p y It's going to run and blow up and it's going to give you this 401 authorization required That's a good sign because that means that you haven't yet updated your values in hidden p y and so The this this is the url. This is that augmented url And you can see the consumer key and the consumer secret and the oa token and whatever Okay, so these tokens are like wrong. These are the these aren't oops control c They aren't real and then and but you'll notice it doesn't the key in the secret in the token key The token secret in the secret um, and that's all actually encoded in this signature. It turns out that You you need to have the key and the token. I mean the secret in the token secret to generate the the uh signature And um, where is the signature? Oh, there's the signature, right? There's the signature and so This signature combined with the knots that you can only do the signature as a time and includes all kinds of things So even if you type this in well, you'll see these go by And it's not really breaking my security too much when you see these afterwards So don't get all excited when you say oh, are you revealed your token in your your key? Well, I can reveal my token in key, but I can't I'm not going to reveal the secret okay, so This adds all this oa stuff off nonce all a time stamp And these time stamps and nonces are made it so that you can't replay my url Even if you see the exact url Once I hit it then you can't hit it again and so that's what the nonce does So i'm going to close hidden dot py here And i'm going to update hidden dot py in another window Okay, so so I just in another window. I updated hidden dot py i'm not going to show you that But now i'm going to run python tw test dot py so tw rl It's going to read hidden and now these keys and secrets are my real ones that I haven't shown you So this should work fingers crossed Yay, it worked. Okay, so It worked So i'm calling twitter. Here's the url Now don't worry the token and the consumer key are not enough to break into my account And neither is the signature because you can't replay this in about five Minutes you can't replay this anymore. Okay, so You can't generate the signature. I've done one this the signature The signature includes the time and date so you can't Trust me go read up on owak. Don't worry. I haven't really revealed anything But so the first thing we see is this so we see and we should put like the line of dashes here This is the json. It ain't very pretty. It's not very pretty. Okay, and so that's the json from there to there It's just what most apis give us back. It's really dense json, right and so this is a byte array remember How you have to do a dot decode I didn't do a dot decode here And so this is telling a python is telling us this is a byte array, which it's a raw set of bytes That came from the internet which probably are utf 8 and if I put a decode here Then it would decode if I say dot data dot decode there Then it would be fine, but we don't care. This was just a dump. Do we get anything? and so then Okay, let's do this print I'll just make this code different Put some equal signs here a lot of equal signs so we can show them easily see the where the Where the thing starts and stops so we'll run that again if you look at those URLs So that was all of that stuff and then This is the headers and so the headers again are not pretty You get the headers. It's a dictionary. You got cash control. No cash comma. This this is the string Key value. You got to find your commas key value, but the one that's really interesting here is Uh, which one is it? x rate limit remaining right there x rate limit remaining So that means that for this particular api And this header tells me that I've got 898 calls left and this is when I will get more calls and Uh, yeah, so so let's see. Yeah. So so watch I want to do this again And you will see that I can only do this 897 more times now Run it I can only do this 897 So I am being tracked at this point. I am being tracked by twitter You know twitter knows that it's dr. Chuck that's doing this and dr. Chuck has done 900 he's done 899 897 and if I keep running this eventually twitter will tell me You got to wait for a while and that's because twitter doesn't want me under my dr. Chuck account pulling out like lots and lots of stuff out of twitter and making my own website I do actually have my own twitter website Using some cool software Dr. Chuck com slash twitter And this I have to run and it rate limits and causes all kinds of you know, whatever so Okay, so great limit So I'll save that So that's tweet. This is just to test it. Okay, because we're doing I want to do something interesting So we're not parsing the json that comes back. We're not doing anything tricky with this And away we go so Let's take a look at some more code. I think I don't need this anymore so now I am going to Parse this so most of this looks the same. I've got that same user timeline json I'm going to ignore the ssl certificates. I'm going to write a loop. So I'm going to ask the twitter. I'm going to print I'm going to get a twitter account and quit if it's a blank line or if I had to enter it I'm going to use the twitter url augment the same way that's going to do all the signing using from hidden dot py Retrieve it and I'm going to retrieve it ignoring the the ssl errors And then I'm going to decode this time I'm going to decode it so that I get a real unicode string and I'm going to print the first 250 characters of it I'm going to grab the headers and I'm going to print the uh the remaining The rate limit so this is sort of a very simple version of this Same thing. It really is decoding the data and only printing the first 250 characters. So let's run that dr. Chuck Boom and it's got 896. So that's just a little simpler version that with a little less brutal debugging Okay, so now let's do something even more fun. Let's go to twitter 2.py and tear it apart and so So again, we're going to look at my friends list Or someone else any of his buddy's friends list. We're going to ask for the friends And ask for the screen name Ask for the first five friends and then look at their statuses Open it decode it get the headers print the rate limit remaining all this stuff is the same as in twitter 1 And but now we're going to parse the JavaScript. I'm not even putting this in a try and accept because hey I'm talking to twitter. I'm going to guess that twitter is going to give me the right stuff You'll probably want to put a try and accept here that I'm going to do a debug print. I'm going to do a json Uh pretty print. Let's make that b2 so it looks a little better Um And then well, I'm going to run it and then you're going to see how we have to parse this and we're going to see that It's a list Um, so we're done with that and now we're running twitter 2.py So I'm going to go to dr. Chuck And this is going to ask the question who dr. Chuck's friends are Okay, let's go to the top so it hit this api and it has the Our screen name dr. Chuck county equals five and all this oauth stuff again This is not a security breach by showing you all of this Because the signature the secrets aren't there Okay, so if we look at it it's an outer It's an it's an outer object Or dictionary and then the outer has a users which is a list And then each user has some stuff in it. So this one's Stephanie teesley. It's got her screen name It's got some descriptions Keep on going. It's got her status her latest status For my friend her status Her source where she's at I don't know man. She's got a lot of stuff here. Okay, there we go. That was the first one Okay, and then the next one that I'm following is live edu Etc. And so you'll see that this is an array. So that outer thing is an array of users now j s here is A dictionary So I can say for you in j s sub users. Well j s sub users is a list So the first u is going to be this Stephanie teesley u and the second u is going to be live edu So that's all it took to get through all that stuff and figure that out And then I'm going to say Get me the screen name of my person. So let's go in here So that's going to pull Stephanie teesley s Steph teesley out Then then I'm going to go find her status Let's find her somewhere in here You You sub status sub text Come on. Okay. There's sub status sub status is all this stuff More more more more more more right there. That's status That's u sub status is that and then u sub status sub text Is this stuff? So it's going to extract this bit right here Okay, and so You use status text and I print out the first 50 characters of the screen name status And I do that for the first five because I told I told it I only wanted five And then of course I get to see the right limit. So let's go down to the bottom So all of this is the debug print of the json. I got back Here is the program starting to print here is the screen name of my first friend And here's the first 50 characters of her most recent status Here is the screen screen name of my and these are in a reverse order who I've been Following so I've been playing with this live coding stuff. So I'm following them What Key error status that didn't work Oh, that's because live coding tv somehow doesn't have a status So most of these work. So now you'll get to see me fix something and when you download it, it'll be fixed And so it says key error status. So that means that I've got to do a thing that says If status not in You Print no status found Continue. Yeah, since sometimes there's no statuses. Who would have thought I did not know that? Yeah, so you Okay, so let's run this again On it even did I get to see my remaining? Oh, actually, let me change the order of this Let me Put this down here That'll be wrong from the slides, but it'll be prettier now Let's put the headers after the dump of the data Okay, so let's run it again Did I save it? Yeah Dr. Chuck Blah whole bunch of stuff. So I got 13 remaining calls on this one. So it's not the same as the other one I'm you know, I don't get to call this too any more time. So hopefully I'll get the debugging to work sort of I got a bad space here. I'm gonna know not status found no status found and I'm putting I need to put three spaces there No status found I'm making asterisk. So let's run it again And see I got 13 remaining. So it's important you write code that's aware of your remaining That's why I made so so obvious about that go retrieve all that I got 12 remaining But my code starts to look oh dang it. I now another space here. Hang on gotta fix that I need yet another space hopefully I can make this as pretty as I want it to work always I didn't even do dr. Chuck. I did that wrong Type my name wrong. Okay. So now it works Oh, well So so now I have my first most five recent friends are this deft easily live edu official life coding tv nancy gilby Um, and greggy cruder. And so there are their statuses and I tore all this jason apart using, uh twitter 2.py of course after fixing Hidden dot py which I'm not going to show you because it actually contains My real consumer key and consumer secret You're seeing the consumer key and the token key go by on each of these URLs But what you're not seeing is these two things which are the thing I'm protecting so that it's not a problem Okay, so I will send that up But uh, there you go. Welcome Uh, thank I hope this I hope you found this useful. Uh, the code will be fixed when you take a look at it Uh, and download it here from samplecode.zip Hello, and welcome to python objects. I'm charles severance and uh, we're well on our way to uh To get through all this material in the python. So this lecture is in a weird place I even debated where to put it in the book um I don't really want to teach you how to write a lot of object-oriented programming but we're going to start using objects and I want to be able to use the terminology and so as much as anything this lecture is about terminology and understanding The words things like methods and method signatures and variables and inheritance And so think of this as a terminology lecture rather than a learn how to program or learn how to Use this it's not something you're going to figure out right away And there'll come a time when you as a programmer really want to start using object-oriented programming It's really a powerful and wonderful uh technique But uh, I think it's uh, too early as a beginning programmer to really say oh, let's write a bunch of objects so just relax and enjoy and learn this material and think of it as sort of a A theoretical thing rather than you know a how to program thing and so Part of this is we're going to start reading data structures and I mean data On how to use all these libraries et cetera and we're going to see the word objects, right? And then we're going to start hearing them and I want you to be able to read the python documentation so that you understand what's going on and So, you know the word objects should make sense to you even though you're not going to write a lot of object-oriented programming And so page upon page upon page database stuff, which we're going to talk about soon is Uses objects all over the place and the beautiful soup talks about uses objects We've kind of been using them and I've been waving my hands and I use the word method without defining it But now it's really time to Define it and go go to it. So um I want to review From the very beginning what we think of as a program So the classic program my favorite little minimum program is our little elevator floor converter with Which converts from european elevator floors to united states elevator floors And the key to this is that It's input processing and output and this is a good way to model any program um And in that process we've got variables and we've got Logic we've got algorithms. We've got loops that we write. We've got all kinds of things and we construct a series of steps to achieve some goal In object-oriented and and frankly you've been using object-oriented all along The program has lots of objects And we're sort of putting stuff into these objects taking stuff out of one object and putting it into another object And you you've actually been doing this all along as soon as you're looking at dictionaries and lists You're doing objects and so it's An object is is quite a little thing. It's sort of its own little space inside of a program that contains code and data And so We're working together all these objects are now working together It's a bit of self-contained code and data And it is one way to take a very complex problem And make it easier by breaking it into separate things that can be engineered and and develop separately So you've been using string objects or maybe you'd used beautiful soup or something These are powerful capabilities and if you had to look at all of them It's just hey, here's a thing use this object It'll do these things for you and there's lots of details inside of it Just don't look at it. Don't worry about it And so there's boundaries that things that you can use things that you can look at and things that really you don't bother looking at You go read the documentation and use it and A way it goes but then someone had to write that and so they built an object So we're going to look a little bit under the covers of what it takes to build some of these objects And so if we think of this program that originally just sort of did processing We can think of it as having some kind of an input right Coming into our program and we have a string object a dictionary object Maybe eventually some objects like a database object or an object that we eventually define And you can think of us. We're receiving data It comes in an object which is a string object or start putting the strings in dictionaries and do whatever We pull out a list of them and and so you can think of data is moving between these objects And like I say even strings in the first week First lecture the first week first everything we We were using objects and we've been using them all along And so you can think of every string and every dictionary as a little Program all by itself that has a bit of code and a bit of data Um, and so a string has the data which includes all the characters that make up the string But then there is a method called a Upper that does uppercase or our strip that strips off the right a white space from the right And so it's it's like they're almost little programs that have inputs and outputs themselves And we can make lots of them and there's lots of cooperating objects that make up an application um And one of the nice things about the object oriented pattern is that they Form boundaries and within the boundary if you're inside the object you can say look I'm going to build you a string object or a database object or a beautiful soup object And I'm going to build this capability and I'm going to give it to you in the form of an interface And I'm not really going to care how you use it And so we have this sort of visibility wall where I'm going to make an object and I'm going to let you use it And the maker of the object doesn't necessarily have to know every single thing About the use of that object But so just like Inside the object they don't have to worry about what you're doing with the object outside of it When you're outside the object you don't have to worry about what's going on inside of it We as the user of the object we talk to its interface and we get things from it and give things to it and Use functionality within that object, but we don't have to look inside of this We can just say oh, it's a nice little magical thing We read the documentation We read a web page and it told us to do this this and this And away you go and so it is a it is sort of this isolation boundary that works both for the Programmer who's writing the object and the programmer who's using the object and so it's uh, it's a very nice pattern Um, and so you'll see how we're going to build code and we're going to group it together And then we're going to be using it sort of is a big blob of stuff So some definitions in this space words that I want you to understand um When we're going to create one of these Things one of these objects instances That has some data in it and some code in it We have to be able to define the shape of this object What code will each object have in it and what data will each object have In it and that's called a class The key to a class and this little picture that I've got up here in all these slides Is a key the class is a template. It's not the thing itself. So it's a cookie cutter It knows a lot about how cookies are made and if you have cookie dough and you hit the thing Then you make as many cookies as you want and so this nice little cookie picture is a great You know Mental model of how it works the class The class oops The class is the template and then the object Are all of the cookies that are made from that template But the template defines the shape and the nature of the class. So the code that we write is Is going of each of the objects the code we write Is the class code and then later we say oh, let's take that template and make ourselves an object or an instance Now as we're defining a class We have two basic things that we put in the class and there's a couple of different terminologies for this One is method which is code. It's like a function that lives inside of a class not a function that was Inside your program, but one that lives inside of a class. And so this is a scoping thing a method is really just a function But it's lived it lives inside the class and then fields or attributes are data items that are in the class And so there are variables that are defined in the class You can define variables outside the class that you use in your program You've been doing that all along But if you're saying i'm going to build this capability and it's going to have data inside of it and code inside of it The code is the method or message and field or attribute and they're just There are just two um Two different sets of terminology Method is what i'll probably use if you look in some object or any patterns like small talk or Apple they often don't call these messages So you can either like access a method inside of a class or an object or you can send a message to the object The same is true for field and attribute. There's just a chunk of data that's in the object that may you may or may not have the right to access So like I said a class is a template it defines the characteristics of the objects that we're going to use to make it It is the cookie cutter So dog is sort of the exemplar Lassie is a particular dog and so dog has fur and dog barks and dogs do all these things And so we know something about dogs But it doesn't mean we have a dog, right? We and the and the the class is a more abstract concept that that when it's time to get a dog We know certain things about dogs instances or objects are Once we say oh time to make a cookie from the template time to get a dog. We know something about dogs That's the creation of an object and that we call them instances instance of a class so the class is a Doesn't exist, but we say make me a new Object using this class as its template. Oh, and I'll make me another one And so we can have many many objects from one class. So just like many cookies from one cookie cutter Method is a bit of code that lives inside of an object It's like a function, but it's scoped to within the object or within the class Okay, so that kind of gets us started on some of the terminology and we'll come back and we'll take a look at how we write code And that's object-oriented Okay, so now that we've gotten through the definitions. Let's work into some sample code But hey look at this. We've got ourselves a cookie cutter and some cookies So remember that a class is a template. It's not the actual thing an object Is an instance of a class So you have to take the class and do something to make the object And actually you can see here some other classes There's clearly a sort of a snowflake class and a gingerbread man class That's an object object object somewhere out here. There is a snowflake class and a gingerbread class But we got a snowman object and a snowman object and a snowman class. So Class is the template Object is the instance. So here's a bit of python code. So let's take a look at what we got here Class is a new reserved word kind of like def We have the name the class that is a name that we choose We're gonna that's the name by which we'll refer to this class for the rest of this program And it has a colon at the end of it and which means it starts an indented block which ends when we de-indent Inside the class there are generally two things There is some data and this just looks like an assignment statement in the class x equals zero And then there is a def def look this looks just like a function and then it starts with a def Has a colon indents so that function finishes right there The difference is is this is a method because it lives inside of a class And so there is no function called party. There's a function called party within party animal class And we'll talk in a second about this self thing It is the way that inside this code refer we refer back to that variable So this is not actually executing any code. It's sort of remembering the template defining the class party animal This is what we call constructing. We're constructing a using the party animal template or class We are making a party animal and then once we make that we stick it in the variable Uh an and then we're going to call this party animal This party method three times one two three Now this self thing and we'll take a look at the self the self ends up being An alias of an and so you can look at this syntax. It's just kind of an equivalent of this syntax It's calling the party method within the party animal class And passing the instance in as the first parameter And so self ends up being an alias of an each time these are called Now if we make a different variable and a second object, which we will eventually You will see that that that works a little bit differently. And so this syntax is a short version of that syntax So if we watch how this executes It's oops It starts up here It just defines it and then we construct it And that's what basically Constructing it we know how to construct it because we look at the class and we make a variable x We make some code party and then we construct that that's what the party animal does and then we assign that into an And so an is now pointing at that Then when we call the party method That basically takes this an and passes it in passes it in as the first parameter Which is used as self and so self dot x which is what we're doing in this line right here Self dot x is a variable x starts out as zero X starts out as zero because when it was constructed it was set to zero So we're in here an is an alias of self and now it looks up self dot x Which is zero adds one to it and so this becomes one And then we print so far so far one and then the code returns and it goes down and does it again And x becomes two prints out so far two comes back down And does the last time calls it again self dot x is two add one to it and stick it back in So this becomes three and we print out three and then the program finishes And so you can think of this as constructing the object and then associating it with this an variable Now that we've created this object we can play around with things We've played around before with dur and type we use dur and type to kind of inspect variables and types And objects so we've been using objects all along We this code here says hey make me an empty list Well, it turns out that what we're saying is there is already a list class inside of python And we're constructing an empty list and when we get back this empty list We're assigning that into x So x in a sense contains or points to an empty list So then we say hey, what is in x? What kind of thing is x? Well, it's a list This is a thing. It's a list type. It lists have list of things in them And you know use a pen and all the things we've been doing before They're just objects and then the durr if you remember the durr the durr is the capabilities And there's all these internal capabilities that do things like implement the bracket operator, etc Those double underscore ones we can ignore them although you can even look them up and figure out what they mean If you feel like it but the methods that we tend to call are in this class and so things like x.sort I've always told you that is the sort method within the x thing and the dot operator Is the operator that we use to look something up within an object And so you've been using the syntax All along x.sort dictionary.items all of those are methods within the corresponding class If we take a look at this line of code that we've been doing for a very long time Which says oh stick hello there into y it's if I reword that as more o o or object oriented What this single quote does says make me a string object And put some text in it and then when that is done being constructed stick that into y right and so y now Points to a string object that's been pre initialized to the string hello there Now that's a long way of saying hello there ends up in y But in o o terms we can talk about that if we do a durr of that we see a whole bunch of internal Methods which have double underscores and then we see all kinds of methods that we've been using we've been using methods like upper we've been using methods like find we've been using methods like r-strip right we've been using these methods so we're going to like y dot r-strip parentheses again that's a method that's an object Not a class it's an object and that is the object lookup operator Now if we do the same thing to code that we've built or a class that we've built so now we have A party animal class remember this up to here is just definition Now we construct it and we store it in an So an is a variable that contains an object of type party animal We ask it what type it is and it prints out here it says this is a class And it's main underscore party animal and this whole thing here is the underscore main It's scoped underscore me but you can see that you have made a new type You built a type by using this class keyword and then we use the durr remember durr looks for capabilities And again you will see You'll see a whole bunch of underscore things they have meaning you can look them up But eventually you'll see the two things that you've put in it One is the method party and the other is the attribute or field x and again These are the things that you can say an dot x Or an dot party because this dot Is the object operator the object lookup operator that says look up in the object And the thing x or look up in the object an the thing party Okay So up next we'll talk a little bit about how objects are created and destroyed We also call that object life cycle Now i'm going to talk a little bit about object life cycle and what we mean by object life cycle is the Act of creating and destroying these objects and i've been using this term constructor already and so When we declare a variable, whether it's a string or a dictionary or a party animal where there we create them And then they're discarded and there's all this dynamic memory that comes and goes And we as the writers of objects have the ability to insert ourselves at the moment of object creation And at the moment of object destruction and we make Special functions that we call the constructor the object constructor or the class constructor and the destructor And we don't actually explicitly call them. They're called automatically by the by python on our behalf And so the constructor is the most more commonly used it's used to set up any initial values of variables if necessary Etc etc destructors are we'll cover them, but they're they're used very rarely So here's a bit of code that we've got It's our party animal and a lot of it is the same as what we've been doing so far Um So we have this variable x and the constructor has a special name underscore underscore anit underscore Again, we pass in the instance of the object self and in this one all we're going to do is print out that you're constructed And here's this code that we've had before and now we have underscore underscore del And then we pass in self and we'll just print out that we're being destructed and what the current value of x is for that particular instance So let's go ahead and run this Um And so again, this doesn't really do any code up to here that just defines party animal But this is the constructing of it and basically that says oh And it really kind of creates these variables and then it also runs the constructor And so in this case this line right here is causing the iam constructed message to come out Then we do and party an party and that says, you know one and two And here's an interesting thing. We're actually going to destroy this variable by Throwing away an no longer points at that object an is what a point to 42 so we're going to sort of overwrite an and put 42 in it and at that point Python's like oh this whole little object that I just created somewhere. It's out here It's vaporizing it and throwing it away And so before this line completes it actually calls our destructor on our behalf And so that message comes out So we are allowed as the builder of these objects to add these little chunks of code that says I want to be involved at the moment this object is created And I want to be involved at the moment that this object is destroyed Now in this last line an is no longer a party animal an is now an integer It's got a 42 in it. It's gone. It's been created. It was used and then it was destroyed Okay, so you got to be careful if you overwrite something you can sort of throw the object away So the constructor is a special block of code that's called when the object is created to set the object up So we can create lots of instances everything we've done so far is we make a class and then we create one instance one object And each of these objects end up being stored in its own variable. We have a variable an and we've been using it But the more interesting thing begins to happen when we have multiple instances of the same class Sitting in different variables and it has its own copy of the instance variables. So let's take a look at this So This code here. I've taken out the I've taken out the destructor And it shows a little bit more information. So now we're going to put two variables in here We're going to have a current score or whatever and a name and we're going to start it out as blank And this time we're going to add a parameter onto the constructor And so the self comes in sort of automatically As the object is being constructed But if we put a parameter on the constructor call, which is this party animal call Then this comes in As the z variable and so self is the object itself And z this first parameter is whatever parameter we put here everything we've done so far has no parameter here But now we have a parameter here and then that means that when we call this constructor This line of code comes and then name is no longer blank name is going to be sally in this particular thing And then I'll say oh self dot name, which will be sally was been constructed And so then then we have this and that Object is now constructed and then we put it in the variable s And then we call the party method on that and we construct a different one And so this time it calls and z is jim and we basically have a Oops Another copy of this and so this is how it's going to look right as as it runs down here As it runs down here when this is called it makes one instance And stores that in the variable s and there's a variable x in there. There's a name in there There's an init method in party and that's all in here right all that stuff is in here And now we say let's make and that's going to have a sally in there All right sally in there And then we're going to do another constructor and so it's going to make a whole new thing And it's going to store that in j and this one's going to have jim in it On s party then this turns into a one and then we're going to call j party That turns that into a one and then s party will cause this to be a two Okay, and so what happens is is we have now two objects One in the variable s and one in the variable j and they have separate copies of their instance variables These are the instance variables or the object fields or whatever, but they're the variables But the key is is that every time we do a new construction It duplicates this and there's another copy of it. So there's an x within s So s dot x is this variable and j dot x Is that variable? Okay So the next thing we'll talk about is inheritance and that's the idea of taking One class and extending it to make something new So the last topic we'll talk about here in object orientation is the notion of inheritance And this is a form of code reuse and it's one of the more advanced aspects of object-oriented programming So just kind of understand what it is at a high level And then you know where to come back to when you need to learn a bit more about inheritance So the idea is instead of making a new class from scratch We actually make a new class by starting with an existing class We are extending it or another word for this is subclassing and it's sort of a A situation where you're like i'm gonna i've got this code and i've got this data And i just need to add a few things to it and then i'll have a whole new thing Um and as you design objects and what we call object hierarchies You often do this and it's a form of sort of real clever code reuse But again Don't necessarily think that you're supposed to know when to use this or why to use this is right now It's just terminology. Okay. Just terminology We have what call these as parent child relationships The original class is called a parent and the new class is called the child class So subclasses are another word for this you have a class and then you subclass it I think extending and inheriting and parent child are probably better ways of expressing it than subclassing So here's a bit of code. Let's take a look at this Um, this is this code's unchanged. It's the party animal code that we've been saying all along It's the one that you we construct and put a name in And now what we're going to do is extend it and so you'll notice that this code down here is the part That's doing the extending so we're making a new class football fan And by putting in parentheses before the colon party animal that says Football fan inherits everything that is party animal meaning the x the name the init the party All those methods and data are sitting there And now we're going to add a new variable So football fan has in addition to all those other variables has points and it has a touchdown method And you know point self points is added, you know to we add seven of the points and then we call the party And when that does that so this is calling this method because football fan includes x name and party and init and everything and all this this constructor So so this football fan is really an amalgamation of all these things together party animal Is just this stuff, right? But and so we still have two classes We don't just have one we didn't erase the party animal class And so we take a look at the code that we can run here We can say oh, okay. Let's make a party animal sally And so that constructs a an object like this and then stores that in s And um with an x starting out at zero and and then we call s party Oops better change that color Starts out at zero and then we call the party method and that changes it to one Okay, and so this is This bit of code it's as if this part doesn't matter at all because it is a party animal It's not a football fan, but now if we take a look at This code down here Take this code down here. We're going to construct a football fan And pass in gym But football fan has no underscore underscore init So that actually uses the underscore init from party animal because We extended party animal to make football fans So we inherited all of the good that was in there So there it's going to make a name a variable x which is going to start at zero A variable name that's going to have gym in it and a variable points It's going to have a zero in it. So this j variable has More things in it than the s variable has and so we can call the j party And if we call j party that goes here and adds one to x All right, so that adds one to x and then we call j touchdown Well that comes down in here And adds seven to the points Right and then calls party within us and so so self dot party is the current object i.e Self and j are the same thing right self dot party and then it goes up here And passes self in and it adds one to the x in this case of this j variable So this becomes two And that's where it prints out it prints out, you know Seven and two and away you go and so It's a way for you to kind of take all this stuff and stuff it Into a class by making a new class and just add the extending bits the bits that are In addition to the other stuff. So like I said inheritance is a powerful and wonderful concept. It's a form of uh excellent form of reuse, but uh basically The whole purpose of this lecture was so that I could in the future just use these words and you would Understand them as compared to I just want to say method and I've been saying method all along in this high time that I defined it So let's just review one last time class is a template It is not actually a thing. It is a shape of a thing And we define it and say when we make one of these things it's going to have these variables in it It's going to have these methods in it Attributes variables within a class method is a function that's inside of a class Object is once we construct a class we get back an object and so object here is The snowman cookies class is this snowman cookie cutter And a constructor is a bit of code that sets up our object our instance When it first is created an inheritance is this ability to Create a new class but take all and import and affect all the capabilities of an existing class so Object oriented is awesome For the rest of this class, we're not going to write any object code We're not going to use class at all But we are going to use objects and literally you've been using objects from the beginning of this course as soon as you said Um print whoops as you as soon as you said, you know x equals high That's an object and as soon as you said x dot upper You were calling a method right you've been calling a method all along when you're doing something like F h equals open This thing you're getting back that's an object and then you do fh dot read or whatever You're calling a method in the dot operator. So you've been using objects all along I now i'm just finally explaining to you when I say call the read method or call the upper method or What's this little dot and why is that there? So again, it's time for us to Understand that but you will it will take you a long time before you encounter a problem That's large enough where as part of your solution You're going to make a new object. But when you do it's really a powerful thing. I mean It's a really bad idea for me as a teacher say oh write a bunch of objects It's like it's it's premature for that. It's later is when Um, you will actually learn how to use objects and you'll be like oh, thank heaven that these objects are here. Okay So That's all for now. Thanks for listening. See you on the net Hello, and welcome to our chapter on databases. We're going to learn a lot in this chapter Learn a whole new programming language sql and learn how to use that So you're going to need a new piece of software to run all of the exercises that i'm going to do Called sqlite browser. We're using a database called sqlite Go ahead and download this you might have to pause and come back if you like go to sqlite browser.org And download it and install it While you're doing that, uh, we'll talk a little bit about the history. So in the old days 1960s 1970s I started doing computing in uh 1975 um We didn't have a lot of storage. I mean this is you know 16 gigabytes right here and You know, we didn't even have megabytes. I mean the computer I had had a few megabytes of stuff. So Well, so we didn't have a lot of disk drives and so permanent storage Was often sequential in these tapes these tape drives that we had Tapes and tape drives were the scalable part of storage because you could just make more tapes and you could rack them up And so that was our way of greatly increasing the storage of the computer The problem they had was is they were sequential you read it advances read it advance read in advance Now interestingly We've been writing programs that do this that everything we've written so far pretty much reads the whole file Reads the whole web page reads this everything we read it We read either a loop or read the whole thing and that's because we have plenty of memory, but We're still reading sequentially and um And so the way you would do this when you didn't have enough spinning Storage or online storage is you'd use offline storage, but the trick would be that you would sort it So let's imagine that you're a bank and you have a bunch of Accounts only a few of which are active on any day And you have a tape that has In account number order from low to high The the prior balance last night's balance of every one of your bank accounts And then you do all the transactions and you record how much money was taken in or out For each account number and then you sort those transactions And then what you do is what we call a sequential master update and that is you would write a program That would read the first transaction and hold on to it. So, okay, this is count 45 Then it would read the first count like one and it would copy one and then it read two and read like seven eight 42 43 then we'd read like 44 And then we'd read 45 But it would now it would change that and write the new 45 and read the next thing And so this might be 60 and it would read a bunch of stuff and copy a bunch of stuff And then would finally get to 60 and it would merge the add their subtract And so the the old balance ended up here and the new balance did here And you had to only make one pass through the data. So it was super efficient So we had all these mechanisms to sort We used to do punch cards and have sorters and all these things and then these things would run for hours And if you watch old tv shows these tapes are spinning and these things are running back and forth These are simply reading and writing tapes And that's how we did a lot of data processing because we could store far more On a tape drive than we could on a disc and with the racks of tape drives We could scale the storage that our computers had And so that's the way we did data processing But it meant that you the only way you knew what the old balance was Was it was the balance as of this morning before your bank started? You don't know what the balance was for the day and that led to things like You can never return You can never withdraw more than a hundred dollars a day or something like that because you you don't know what the old balance was Or you might go withdraw a hundred dollars at a couple of different branches And and so they they didn't they weren't able to look your stuff up right away now It didn't take long until the disc drives got better and better and better And you could store the entire accounts all the accounts and their current balances on computers And then the problem becomes is what happens if sort of in the middle of the afternoon you want to update a balance? Well, do you want to read all your data and then write a brand new one and that's say that takes like 10 minutes that means for that 10 minutes only one person can be updating their bank balance and so because We could randomly access this data. We didn't have to read it all sequentially The trick was is how do you spread the data out and then how do you make it so you can change a balance This is of course second nature today, but how do you make it so you change the balance here without Changing the balance there and you can have multiple people going simultaneously to these things and make sure that you can't say withdraw money at two different locations simultaneously and somehow have your bank balance get corrupted by that so there's a lot of debate on how to do that and In early days, we just did sequential master update But increasingly we wanted to make better use of the random nature of our computers And and our storage and so that's what led to databases databases are the science of how you make use of Rotating random access data permanent data in a way that allows you to read modify and update that Simultaneously from many different locations and yet keep the data completely consistent And so this led to a study of a thing called relational databases and there's relational databases are not the only databases That that happened we had many other kinds of databases and there was a debate and I remember in the 70s in the 80s There was a folks that says oh no no there you can do index sequential That's the way to do it and relational databases weren't popular Weren't all that popular the first time that that I saw them. I I didn't like relational databases The relational databases had an inherent advantage because they were based on some really powerful mathematics The interesting thing is is early on the relational databases were slower But eventually they figured out how to sort of bring all the cleverness to bear to make relational databases fast And so relational databases are a pretty advanced technology and their companies like oracle that are very very wealthy And their primary Product for many many years was nothing more than a clever database product a clever piece of software That was really good at solving this problem. And that's how important this problem was to computing if you read about databases, you're going to see two sets of terminology one set of terminology comes from the mathematical background and Has to do with the underlying math things like relations tuples and attributes that's kind of like the fancy math version of it and uh programmers kind of think of them as rows and columns inside of a table and so if you look at sort of fancy theory you'll see Words that look like this and they're just full of this and the connection now All this is important and true and if you really want to get good You sort of begin to understand the nature that we model data at connections rather than At sort of intersection points rather than just modeling data as a flat file the way we do But for now we're we're going to as programmers think of this as just like oh It's like a super fast spreadsheet The super fast part is the math for us the rows columns and tables are spreadsheets So it thinks of thinking a spreadsheet of sheets sheet sheet sheet And that's like a table a named thing like tracks or albums artists or genres And then there is rows and each row has a different kind of data And then there's columns and we sort of specialize the first column in many spreadsheets to say what's in there This is not really the data. This is like meta data It's like the titles in this first column. That's not really the data and the data starts here And we have different kinds of data like strings and numbers etc etc for each of the rows and literally you can Get away with this is sort of about 80 of databases is just a really super cool spreadsheet, but Under the covers it is far more powerful than that so one of the early arguments that Happened was again what the programming model for this was and A lot of folks wanted a programming model that reflected how the data was actually stored The notion of structured query language came about in a way to express what you want it to happen And allow that to be sort of a very abstract expression Select all records that meet this criteria not read read read read read And so structured query language is a not a procedural language It is a it is a Imperative language where you're simply saying what you want and then somebody writes the loop the database actually does the loop But it's a a way for you to avoid actually writing the loop now That turns out to be the power of databases Because the cleverness in how to write the loop is a way that you Would probably never figure out how to be most supremely optimal when it comes to writing the writing the loop as you'll see toward the end of Joining many tables together and selecting and throwing a ray and getting down or count or whatever Someone has figured out how to do that really really well. So the idea was is you would express You know, we're going to create some data. We're going to retrieve some data. We're going to insert and delete it create read crud crud Create read update and delete crud And so that's what this does. It's a language that does this very simply Now the applications that we're going to use This for are more of a data analysis application We've been doing data analysis for through the whole course And the kinds of things that we'll see in the remaining chapters is we'll take some raw data file These might actually come across the network and we'll write some python programs to Play with that data parse it clean it up make sense of it You know and then write it into a database and this might be a slow processor This might be really nasty and this might be a way to have very clean data And then we'll write another python program to sort of read this Read through it and it's all efficient and pretty and then we can produce files And maybe we'll visualize it or do work further analysis in our excel or or a javascript visualization framework And so in this situation you will be the person who is Both sort of writing the programs database administrator And you can using sql browser play and look at the database kind of in a raw way And the first part of this we are mostly going to be using sql i browser Just to talk straight to a database later We'll write python programs that read and write data and and visualize the data So so this is what we're going to do first and then second we're going to do this part right here That's the second thing we're going to do Now another really common use of applications and something that if you continue learning more about programming is that you will want to write a An online application like amazon or a company or a or twitter that's got a website and it stores dynamic data in databases And so the picture for that is similar but different than the picture We're going to start out with and so the way this usually works is that You the end user uses a web browser talks to the application And the developer writes the application software and that application software stores its data in a database And inside that database we talk to the database using sql And all the data is actually stored here and the magic happens the data server is that database software that's so precious and valuable And then there is another person often called the database administrator Who has access to the direct access to the data and these roles in medium and large projects are kept separate mostly because the Mostly because the um the production While it's running and live the developer leaves the data alone and works on say the next version of the software And then the developer has a test version of the application that they run on their computer Where they're doing all that stuff and so this database administrator is a is a role in a large Project where we have to run production and keep production careful Keep production in good shape So the database administrator has this responsibility for the production aspects of the data And you may be working in a situation where that you're not actually controlling the data the database servers on different computers You have a little special access and you write programs to sort of read the data um And so the database administrator is the person who is Asked by the organization to administer that data The data that we develop and we'll do this in the second part of these lectures Um conforms to a data model. That's the metadata. Is this an integer? Is this a string? You know, how many columns is this and the data model turns out to be very very important There's a lot of science to building an effective data model that leads to really good performance And it's a it's a collaborative activity between the the application developers And the database administrator to make it so it's efficient runs in production Etc etc etc. There's a lot of products out there That you may encounter and we're going to be using sql light sql light's a little tiny database server And it's built into so many things and that's why we like it But if you're going to work at a large organization, you can easily run into oracle, which is the number one commercial product Microsoft has a thing called sql server, which is a commercial product and it's also very popular and very effective The more popular open source There's things called postgres. There's my sql And my sql recently was sort of bought by oracle and there is a a copy of that called mariah db That doesn't belong to oracle mariah db and so you Most of the sql that we're going to learn is common across these database because database systems because sql is a standard But then there are parts that weren't part of the original standard where each database vendor has done things a little bit different But there is a core common subset that does the basic create read update and delete operations So sql light is a very popular you probably have it in your cell phone 10 12 times your web browser has a database engine in it your car has a few databases in it And so sql light is what's called an embedded database system Python comes built in with with it. You just import sql light 3 and away you go And so it's a very very popular because it's free. It's open source And it is such a tiny little piece of software That you just included in other pieces of software and use it to solve the data management problems of those pieces of software Like your browser might use sql light to store your bookmarks Now you think oh, there's only how many bookmarks can you have? But what if there you need it to be fast and what if there's like people that have 10 000 bookmarks? There probably are do you still want it fast? Do you want to be able to search? And so you get all that by using a database like sql light And so again, we're going to encourage you to download the sql light browser So you can follow along with what we're going to do coming up next And so here is the sql light browser. Here's what it looks like and it's just a desktop application and Coming up next we'll start playing with this desktop application and see how it works So now we're going to make a database. We're going to use sql light browser Hopefully you've downloaded so you can follow along And I've got this handout this basic database handout that saves you from having to type all these things So bring that up in your web browser And so that gives you all of the commands that I'm going to type now And so you could pull them out of the either the web page of the You can pull them out of the slides or you can pull them out of that out of that so I'm going to bring up the database browser here Database browser now the thing that's going to happen. You'll see this happen on my desktop I'm going to make a new database and you have to store it somewhere and so I'm going to put it on my desktop and I'm going to call it py4efund And so we should see a new file on my database right there py4efund now that's a file that you don't want to edit with a text editor or anything like that. This is A database that you're this this is a file that's to be read by sql light browser and nothing else Okay, so we're going to create a table and I'm going to make a table called users With a column called name. That's a text and a column called email So I'm going to it's already asking me to make a table. I'm going to call this users And I'm going to add a field that is called name And I'm going to add a text And I'm going to add another field called email I'm going to make that be text now the key thing here is as we are in effect making columns and Rendering an opinion as to exactly what that column is supposed to be used for and we're not allowed to violate that It's not like oh, we'll do whatever you want because the database is optimizing its storage based on our In a contract that we're we're effectively making contract ourselves We could make these columns anything we wanted But we're just going to we have to we're going to contract with ourselves And you can see it's kind of small here You can see there's a create table and that's on the slide and that's the The the sql way of doing that this user interface is just helping us write sql So now I'm going to just say okay, and if you take a look you can see That I now have a table users and I can look at my database structure and the table users and Away we go and so so now that's that is creating it and like I said here in the slides is the create statement or On the web page. There's the create statement that could have done it now We can insert some data Let's add a new record to this database users And we'll call this guy Name Charles Csev So now we have a record so it's kind of like a database Spreadsheet now that's not the sql way to do it. There's sql sort of going on in the background But if we really want to do this using sql, we're going to use the insert statement And the insert statement Looks like this The sql syntax sometimes has extra words insert into is actually an s to sql keywords the name of table the columns And then the word values and then one-to-one correspondence between The values and in parenthesis. It looks kind of like a a tuple in python, but we're nowhere near python right now. Okay, and so Uh, that's what we're going to do And so i'm going to grab this christen And i'm going to go over here to my sql browser and say s execute sql. So now I can say paste that in And then hit this little run button And that's going to submit the sql to sql light and then Update that file and it says query executed successfully and away we go So if I go back now and I look at the data I see that there's two two things in here and now I can actually insert all the rest of these Let's go back to my little bit of stuff here. Let's put all these other rows in it turns out that If I go into the execute sql and I want to do more than one More than one command at a time I can put a semicolon at the end of each one of these things And then I can run them all for that all at the same time I mean one after another actually is what's going on here So boom boom boom and I'll take a look at the data and look I've got all those things in there now Eventually the thing that's going to generate that sql is a program not us This is we're being the database administrator. So we're sort of doing things manually Once things get going you write programs do that insert over and over and over again in python Or a web language like php or something like that. And so that is the insert Now we can get rid of data And so I'm going to say delete from that's the keyword users is the name of a table Where is a where clause will have lots of where clauses in sql Which is it's not like an if it in effect the delete is going towards the whole table And being turned on and off by this where clause So delete from users if you didn't put the where clause on will actually delete all the rows But where ted equals email equals ted at umich.edu Well, that one is going to make it so it only applies to those to the rows that where that is true So I'm gonna go over here and sql and I'll say delete from Users where email equals ted at umich.edu and then I'm going to run it because it's only one I don't need a semicolon at the end of it. And now if I go back and I look at the data Ted is gone okay update so the update says Updates keyword users is the name of the table set as a keyword And then this is column equals new value and then a where clause again This update if we didn't have a where clause would change every row in the table And so where email equals csev at umich.edu Oh, I got to change that because I already got the name to be charles. So you see the name is already charles So I'll just execute here Make this be chuck. So we see it And then I run it And then you take a look at the data and it's changed That's it. That's an update statement We're doing you're doing great. You're doing great and so The next thing we're going to do is we're going to take a look at how we retrieve data Now this is the select statement select star You have a list of columns and star means all columns from is a keyword and then the name of a table So this select star from users is the kind of thing you type all the time As a matter of fact, it's what SQLite browser is doing internally to cause this to happen But we can do it by hand by saying Select star from users and then run it and so then we get A little record set that is those four records that are sitting there We can also throw a where clause on the end of it. So we say select star from users where email equals csev at umich.edu and that Again, the select star from users goes at the whole table and the where clause goes at the whole table And then filters out all of the things except one record. So the where clause is Send it to the table, but then filter based on on whatever and so it it only shows us that Okay, we're cruising right along here You can also put an order by clause on there So we can say select star from users Order by email. So that's a column Select star from users order by email And so that orders by email Or we can change it by to name and we can say descending So that's the name and and descending order Sorting and selecting are good things that databases are really good at So this is the summary of what I've told you I said the databases do create read update and delete crud And we've done all those things except we did create delete update read. That's what we did That's the summary of sql and so You might be saying why did I take so long to learn such a simple and elegant and beautiful language because it's not really exciting It's a extremely simple language. That's a very predictable And you're like, well, that's pretty easy and it turns out that some of you May have been using sql in situations Maybe with microsoft access or something or actually type in this stuff and you you just kind of typed it And you never realized that you were learning a programming language That's why I like sql and that's a very declarative language and it's very straightforward. It's much harder to learn It's I mean it's much easier to learn sql than enough to learn python because in python you have to figure out how loops work and how iteration variables work And you'll notice there's none of that and so the but the key is is we've only started to understand the power that That's the simple ability to move around and update data and read data Randomly using using these simple sets of commands, but up what next we're going to look at how you do this With data models and relationships and really multiple tables Hello, and welcome to a code walkthrough in this bit of code. We're talking about the email db.py. This is a beautiful little example and then it sort of reduces Talking to the database to kind of it's a it's pure essence And so we'll start out this code and we import the s to a light three just to get the library there We make a connection and the in databases we sort of end up with an open that's two steps That there's the connection to the database which checks access to the file and the cursor is kind of like our handle We it's not as simple as you just open it and read it But you open it and then you send sql commands through the cursor and then you get your responses through that same cursor So c ur here is the variable that we're interested in and the first thing that we're going to do is we're going to We we've got this file. It will either create this file and right now this file doesn't exist It's going to be in the same director directory Oops email Yeah, there's no There's no email db. So this is actually going to create the file when it runs Um, and then the first thing we're going to do is drop the table if it exists drop table is a bit of sql If exists just keeps this from blowing up if we start with a fresh database and in this case There is no file there. So we are starting with a fresh database. So this will accomplish absolutely nothing Which is just fine Now when you're using triple quotes here, I'm just kind of using that to make this a little bit easier to read I probably could pull those lines up a bit Um, this one's uh, this one's actually small enough that I could maybe I'll just do that Let's do that. Let's bring that baby right up and turn this into a single quote That's short enough, right But triple code is just this one here is a little longer. So I use triple quotes So we're going to drop table. That's going to do nothing first time through then we're going to do a create table Now sometimes your application will have like a read me or something says go run these commands to set the database up But we're able to just set this database up in this particular application We'll see later ones where we're going to leave the database and not start it fresh and in this one We can do the same And so but this one in this one we could but then we're just going to start fresh by dropping the table. So we'll create it We're going to have a email and an account We're going to just basically what we're doing here is we're really going to pretend that this is a dictionary If you recall when I said dictionary dictionaries like an in-memory database. Well Now we're using a database to do a database But the first thing we're going to do here is pretend it's a dictionary. So that's a little crazy So these next lines of code, hopefully are pretty familiar to you, right? You get a file name Um loop through it. Um check to see if it's if it's you know, grab inbox short by default So we can press the enter key and then loop through it, right? And so this little part right here. This is our basic This is our basic loop that we're doing And so uh, uh, you know that that is pretty normal And if we look at this line right here that line right there is the line that is um That line right there make sure that we can Uh only get the front lines. We've done that a bunch of times and we're going to split it We're not going to strip the right because the split's going to take care of that And then we're going to grab the email address, which of course in the from line is the second part Um, and uh, and then uh, we will have that So now we're going to do some database. So the first thing we're going to do this This bit right here is kind of like the dictionary part So the first thing that we're going to do is we're going to select count from our database That is an integer where email equals and this part right here bears some explaining This is going to be c7umich.edu or whatever now it is dangerous to put Those strings especially from user entered entered data Into your sql you technically could I could make this be a Email equals single quote c7umich.edu. I'd have to skate the quotes and stuff But this question mark is a placeholder And this is a way to basically make sure that we don't allow sql injection go google sql injection To get a sense of what that is It's more it's more of an issue in online Applications, but in this application. We're just being good And so the way this works is this is a placeholder in this sql That will ultimately be replaced by this now you could have several question marks We only have one in here and so you give a tuple And if we just put email it won't turn into a tuple. This is a one tuple basically this little weird parentheses email comma parentheses that is a tuple with only one thing in it and that's just the weird python syntax It's rare that I apologize for python syntax, but that's a little bit Less than pretty but it's okay It's a tuple and normally there if there were like two of these then there would be email name dota dota dot okay, so So this cur.execute is actually not really retrieving the data In a way, it's looking at the sql and making sure that maybe it might verify that the table name is right Or if there's any syntax there's et cetera, et cetera. So this this actually is not really reading the data So but we have prepared this cursor. This is kind of like The opening of a file, but what we're opening is a record set We're opening a set of records that are going to You know Be this wherever it's true. So it's like we're going to read this like a file Now later things will loop through this, but we're only going to say hey grab that first one Right, we could have even put maybe a limit clause on there or something grab the first one And give it back in row and so row is going to be a Row is going to be the information that we get from the database and so that is If there are no records that meet this then row is going to be none So here's kind of again like the the get here's like the get where if the row wasn't there Because the way we're doing this is we're going to end up with this row in the database Here is this database and there's going to be two columns and there's a bunch of rows and then here's going to be csev four and gen three and Six right so these are the counts and so we're grabbing this variable out if it's csev that we're grabbing And that's going to come into here right that's going to show up in here and that row that row is actually It turns out that the row is a list But we're only getting one thing and what we really are doing is if we if we search through and we got through and there was nothing Then row is none means that there was no and we're seeing like Like chens for the first time and we have to insert it. So if row is none We're going to run an insert statement insert into counts Email count now we've got to set it to one because it's the first time we've seen it So values and then again the question mark the question mark basically says Hey, I'm going to have a value in this tuple and there's an ordering to the tuple And so there's only one question here one question mark placeholder here and then one is the initial count So email question mark count one away we go and so then then we have again We have a tuple that gives to this execute statement Just like in that execute statement the corresponding sort of strings or integers That that are to be replaced by each of the questions So when this runs there's going to be a new record and there's going to be a one that's put in there into that new record If on the other hand we pull back a row that exists. We're going to get this four number And you might think we want to take this four number and add it But in databases, it's always better to do an update because There might be multiple applications that are talking to this database at the same time So no matter what update does is in a single Atomic operation it turns whatever this number is into one higher and we don't have to worry about Other pieces of code potentially modifying now in this case We don't have to worry about that because we're the only piece of code But using update to increment something is way better than reading the value and then doing an update to adding one inside of python And then updating the new value, which is that's two sql statements, but it's also not atomic. Okay So if the row is uh, none If the row exists, we just know that it exists and when they just want to add one to the number We don't we do have the number sitting here in the row variable, but we don't need it. And so we're going to say update counts Set count equals count plus one column name where email equals and then another placeholder and then another tuple For the question mark. Okay, and so that's what this little bit of code does That is kind of the the read it parse it check to see if it's there if it's not insert it if it is update it And so then we see this con commit and this con commit Basically the way it works is that the database Is efficiently keeping some of the information in memory and at some point it has to write all that stuff out to disk So you can choose at times where you put this commit I'm right now We're going to commit every time through this loop But you might commit every tenth time through the loop because the the commit will take some time because it Forces everything to be written to disk and these can run really fast and the commit is the slowest part here So sometimes we do things like commit every tenth record or every hundredth record If it's an online system, which is not what this is You you have to commit at the end of every sort of screen paint, but For this kind of a system because we're putting so much in this is kind of a bulk insert We might come up with a thing where we you know every one every tenth time we do a commit But ultimately what this will do when this is running is it will build up slowly But surely adding new records and then one one and then it'll be able two and a three and all these things and add another one That'll be one It'll do this thing right and then at the end of the day that is what's going to be in the database now So now we're so let's take a look what's in the database and now we can actually read the database And so in the database we're going to run a select And we're going to say we're going to select the email and account from counts order by count descending So look at that isn't that cool We're getting in the top 10 because databases are good at sorting and they're good at all these other things So we're going to then execute this And then we're going to ask for the rows one at a time and the rows are going to be Rows of a tuple and row sub zero will be email and row sub one will be count So we run all this stuff and then we close the connection and away we go. Okay, so Let's go ahead and run this Let's go ahead and run all this stuff Python three email dv the dot py It asks for a file name m box Short now I can hit enter right and buck short And that's it and it looks just like that and it counts it and away we go Now the difference is at this point We have a file email dv dot sql light and we can run the sql light browser And we can then open this database And we can see what's in there. So here we go. It has made an sql light database We have a table of counts and then we can take a look at the data and there we go We've got the data in a way and we can do this And so let me close this it's it's important At times when you you don't want necessarily to have Well, let's see if we can cause it to lock up. Let me run this again And it's going to drop this table. So I'm going to run the code again But this time I am going to do the full one m box dot txt Now we'll see what happens here But it ran and now so what what we have to do then to see this date Is from the previous run, but if we want to get the most recent one We hit refresh and then away we go and so we can see this stuff And so this is just a real simple start To see how you can connect some of the stuff that we've been doing But store the data in a database, but the nice thing about the database is that it can Um Store this stuff from run to run even though in this case we're dropping the table every time In later things we will see how we can store data from run to run to give ourselves more restartable processes Cheers We're going to do some code walkthrough and if you want to follow through with the code you can download the sample code From python for everybody And so the code that we're going to play with is the twitter spider Code that is both talking to the twitter api and talking to the To the database and so what we're going to be doing is we're going to Run code that's going to hit the twitter api much like we did in a previous chapter And we're going to retrieve the data, but we're going to remember the data So we don't have to retrieve it again Okay, and so we're going to keep track of people's friends and what we're doing here is sort of illicitly Pulling down slowly but surely based subject to our rate limit. We're pulling down Who our friends are? And so let's take a look We're going to use url lib and url lib error Twitter url which was code that augments my url to do all the oauth calculation We're going to get json data back. We're going to make a database and we have to import sql because of the way python doesn't trust any certificates no matter how good they are And so this is our url to talk to the twitter api We're going to make a database and again the way sql light works is if this s spider dot sql light doesn't exist It creates it And we get herself a cursor And we're going to do a create table This if not exists Some sqls but sql light 3 does this create table if it doesn't exist We want to start this over and over unlike the the the tracks example I want to start this over and over and not lose data I want to and this is a spidering process and we'll see a lot of these we want a restartable process where we use a database So if we're starting with nothing and there's no and there's no file a spider sql light It creates this table and it's the name of the person Whether we retrieved it or not and how many friends this person has that we know of in our database Now this little bit is to deal with the ssl certificate errors The certificates are totally fine, but python doesn't trust any certificates by default Which is frustrating, but whatever so here. We're going to have a loop We're going to ask for a twitter account We have to type quit to quit if we had enter in this case We're going to actually read from the database an unretrieved Twitter person and then grab all that person's friends okay, and And so then we're going to we're going to if we We're going to do a fetch one get one and that's going to get the name of the first person the sub zero If we had more things than name here as sub zero is the first of those Fetch one means get one row from the database and sub zero means the first column of that first row And if this fails then we've retrieved all the twitter accounts um And so you know We're going to augment this twitter url using this makes you can look at the twurl.py code um This basically uh neat requires the hidden.py file Which has your keys and secrets in it. You got to get hidden.py updated I've got it updated, but i'm not going to show you because it has my keys and secrets in it And so we're only going to take the first five which means we're probably not going to find friends of friends of friends It's only the most five recent ones. We could run this with a much higher number To get to the so we have more than one friend We'll show the url while we treat it. We will do our url open We'll do a read and then we'll do a decode to make sure that this utf This will give us data in utf 8 and then decode will give us data in unicode Which is what we need inside of python. We will ask for the headers From the the the connection We'll say give me the headers give me a dictionary of the headers and The x-rate limiting header from the twitter api Tells us when we're going to be told we can't use this api anymore because this is one of those things And then we're going to parse and load the data that we got from From twitter and get a uh I think it was I think it's a list Yeah, it's a list and then we could dump this if you want and yours you can undo that And then what we're going to do is we've just retrieved This person's screen name and their friends And so the first thing we want to do is update the database And change the retrieve from zero to one and that's because we want we're going to use this to know about unretrieved So retrieve being one means we've already retrieved it and we did retrieve it So for that account we've retrieved it And then what we're going to do is we're going to parse that and so this is the similar to the twitter code We did previously in the web services chapter. We're going to go through all the users We're going to find their screen name. We're going to print the screen name out okay, and um Then what we're going to do is see if Let's see So we're going through all the users who are the friends of this person and we're going to say, oh, okay Let's select the friends from twitter where the name is the friend person And What we're going to do is we're going to if we're going to do a curve fetch one Of this this twitter the name of the friends This is the friend screen name, right? So we're going to say, oh, okay If we get this We're going to get that friend screen name and we're going to get how many how many friends this particular screen name has If we find a url We find it in there We're going to do an update statement and add one to their friend count how many friends they have and then Keep track this count here is not in the database. It's just so I can print it out at the end if there is no record For this particular friend We're going to insert them into it new and we're going to say here's the new person that we just saw Here that's their name. We're going to set retrieve to zero And we're going to say that they have one friend okay And then we're going to commit the we're going to commit the transaction and then we're going to close this at the End. Okay, so Let's go ahead and run this the first time it's going to create an empty database So I'm going to say python 3 tw spider. So ls star sql light Nothing there python 3 Oops, that's because I removed it python 3 twspider.py Okay, so I'm going to start with a twitter account Dr. Chuck And so it's doing this retrieval and don't worry showing the token and the signature is not dangerous Because you don't have the keys or the token. I mean the secrets and the token secrets So don't get all too worried. So I have 11 calls left. So I got I hope this all works One of my friends is Stephanie teesley and I do these are in reverse order. So let's grab Stephanie and Ask for Stephanie's friends So now we just retrieve Stephanie's friends and here are Stephanie's most recent friends And then I can just hit enter and it'll randomly pick Let's see if I can in the database. Let's open this up file open database Hope I don't lock myself sometimes. It's a little scary when you look at the database And you're just checking so this is what my database looks like we retrieve Stephanie and She has this is how many people So these these are the these are the friends of uh, Stephanie and me and these are how many I'm not in there So we retrieve Stephanie which was a friend. So let's go grab Oh, I don't know. Let's grab tim mckay You can get that one remaining 10. I don't have too many of these Tim mckay Right. So there we go remaining nine And so if I do a refresh on this Then you see I've got some more folks if I hit enter here. It will retrieve it'll pick one randomly based on Retrieve being zero. So it won't pick Stephanie or Tim because they're zero But we have lots of other folks to pick randomly and we'll hit enter. So it picked uh, and um, Who did it pick it picked? Screen name live edu tv which is ironic because I'm recording this on live edu tv right now And so we can keep hitting refresh and away we go. So I'm going to stop now because I only have eight remaining And uh, and so I'm going to type quit and so we will see We'll see how that works. So that's how it works now. Remember that you've got to Edit the hidden dot py file to make this work because we are talking to the twitter api if you don't edit that file Um It won't work for you. Okay, so I hope you find this useful cheers So now we're going to take a look at how we deal with more than one table multiple tables because the Real power of sql and the power of database performance has to do with when you start connecting tables together If you go back to that original mathematics it models data at the intersections between the row and the columns And these intersections are the magical bits um And so breaking an application to use multiple tables is an art form. It takes a while There are some simple basic things that you can learn and we'll teach you here And so it's not too hard to learn the basics, but then it's much more complex to be super Skilled at it and and in general advanced databases in my mind. It's hard To teach advanced databases because they're always so contextually grounded You know something like a twitter or a or google the databases are so specialized by the time you make Everyone can do small to medium-sized databases using the basic techniques But at some point once you escape medium-sized databases You end up in these sort of narrow things and optimize each database very separately And so I just tell people, you know, learn the basics really really well Write programs and then go do real work but Database design is the act of figuring out The data that your application is going to want to store and spreading that across multiple tables But we don't just do it randomly. We do it very much Cleverly and if you look at a data model, this is what it looks like and what we're showing here in this data model Is we are showing Five tables and this is a kind of a calendar kind of a system And we're seeing the the columns that are in each of the tables And then we're seeing the relationships between the tables And even in these relationships, there's kind of a little bit of code And when you have an arrow that looks like that, there's many of those to one and this is a many to one relationship Many to one relationship. We'll talk all about that stuff But if you go into an organization and you have a really large and complex data application They might have something printed out on the wall that looks about like this Which shows the database tables and connections, etc, etc And they might say, oh your job is to go down and then this little corner add one column field there And then do this and then connect it with this thing over there and then make a A screen that shows all these things that pulls from this table this table this table and that table And that's your job if you're a programmer on a large software development project these database models become sort of like the core backbone of the knowledge that applications are managing and using So the idea is is that you take your application. We're going to start really simple We're going to take your application and you have to draw a picture and the basic rule And literally you could spend Course upon course learning about database normalization But i'm going to i'm going to distill it into one basic rule and that is never put the same string data in twice so My name charles severance If i have build a database well, you should go into that database and you'd say, okay The words charles severance, which is the name of a person me in that database only shows up once And what we do instead is we connect things together and model my name as a connection to the record that has my actual name in it Rather than putting my name all these other places And so the idea is to pull duplicate data out and make only one copy of it So there is the there is the users and in there is the user's name and the user name shows up only here And everything else points to the particular user entry. So That's the idea And so here is our first application We are Working as a startup. We just quit all of our jobs and we are going to build a music management application I mean, what a great idea. Don't you think that'll be quite successful? And so we have mocked up and we have figured out that this is what our music management application. We want to track people's tracks Know something about what artists and albums and genre they are and have ratings and how many times we played them And how long they are. Well, that's that's the data that our application needs to represent And we've done testing on this and and wireframes and everyone loves this a great user interface And so this is how it's got to look But we're going to have billions and billions of tracks and these things And so we want to come up with an efficient database to handle this And so we're going to take a look at this and look at each of the columns And we're going to ask ourselves is this column part of one of our existing Objects our existing tables or is this got a this object Have to create a new table and then once we've just find those different objects We connect the tables together and model of connections now A little trick to kind of make it a little easier on ourselves is we can look in these columns And look in the columns that have duplicate information vertically that string information So rating is just a number like zero through five So we don't worry too much about integers and numbers and that kind of stuff or or whatever But we do look for strings and the problem here is we got like these strings A core many times and so these are the problems and so we we have to put these things where there is replication of string data kind of in the vertical dimension We have to put those in different tables And so we'll start out Now the first question that you have to ask yourself When you're going to draw this picture of how this data is in multiple tables and connect it together Is what is the first one that you're going to write down? And this is an interesting debate and often people are sitting in a conference room and people who have experience Kind of know what to do. Usually if it's a multi user system like a learning management system The users might be the central concept. Perhaps the courses might be the central concept This is a single user system. And so you can think well, what is really this application about? It's not about people. It's one person But it is about tracks and so we can say, okay Here it will take the the track is probably the sort of most foundational notion of this application And then we can take and say, okay now that we've decided that Tracks are the foundational notion which of these columns are simply an attribute of the track not really And the cheaping the cheating way and the easy way and this particular one is like these numbers All these numbers like this number and these numbers not that one They just go along with track and so we'll put that in we got the track title Rating length and count and we put that in And then the question is we've got the remaining things are we've got the artists We've got the album and we've got the genre and so we can say, okay Well, we can't we've got some vertical duplication. So we're going to say, okay, this track probably Belongs to an album. So let's pull out the album into its own table Oops Pull the album out into its own table and so Pull the album out into its own table And so that pulls that out and then you say, okay What would be the next thing that we're going to pull out? So we pulled out the track We've got this taking care of this taking care of that taken now We've got the album. Well albums belong to artists. So let's take out the artist And then we'll pick where the genre belongs and we'll just say that the genre belongs to the track And so because there might be albums with more than one different genre So each album is not necessarily a rock album It could have a rock track and a country track etc etc etc And so now what we've got is we've got four tables, right? We've got a track table We've got an album table an artist table and a genre table And if we sort of double check all of the columns that had vertical duplication in them now have their own little table So we can we can eliminate the next thing we'll do is to show how we're going to eliminate this vertical data vertical data replication By showing how you represent these relationships that we just created inside of the database Now we're going to represent these relationships in the database And again what we're trying to solve here is this notion of database normalization third normal form There is so much theory, right? But in in this lecture, I'm just going to condense this down to don't replicate string data And use what are called keys use integer keys To point at those things and we're going to use these integers then to point So assign each row an integer and then we're going to point from one row to another using those integers And so we're going to add these special key columns to each of the tables and help the in the database will even give us help Managing those now and so we still need to keep track of you know Who is the creator of the album which album a track belongs to we've got to Create these relationships and we have to come up with ways to store those relationship And so the idea is is we're going to have a column in one table, which is the key column And we're going to call this the id column and so this is a row it might have many bits of data here But in this case, it's just the name of an artist So this album is going to belong to an artist And we're going to assign a number inside the database and so that led zeppelin Is one and acdc is two and so we have this key This is called a primary key and then later when we want to say that the who made who album really was done by acdc We put the number two in and so the difference here is instead of saying acdc in this Record we just put the number two once we've established this number So we assign keys and then we have these pointers that point back And so that's how we model a relationship We with these small integer numbers And so there are three basic kind of keys that we use one is the primary key And that is that little id column that is just a number But once we give led zeppelin the number one led zeppelin is the number has got the key one for the rest of that database The logical key is the text area that we use that you might look up So the title of the band or the title of the album That's a logical key and then the foreign key is one of these keys that is really pointing to the primary key of another row So that's called a foreign key And it's you might think that you want to use something like an email address as the primary key for a user table Or something like that the logical key should always be separate and there should always be a primary key that integer number Because things like logical keys do change people do get new email addresses And if you've got that email address as a foreign key pointing all over the place It doesn't work out so well And so that's why you use these small integer numbers that have no meaning outside So sometimes if you're on a system and you see a url and you see some number like 422 2016 you're like, oh that turns out to probably be my primary key in their database So sometimes you can look in a url and you can see these primary keys in the url But they don't mean anything outside of that particular system So like I said a foreign key is a key that is really pointing at a row in a different table And so so we have the album has a primary key for it But the artist underscore id points to a row in the artist table as we will soon see I have a naming convention and in my naming convention on this lecture I use id for the primary key and then artist underscore id I use uppercase for the table names And then artist underscore id says this is a key This is just a key that points to the id key of the artist table And so that's what I do So you'll see and all my stuff i'll use that it's a convention It's not something sql forces you to do But you will find when you go to organizations and work on their databases These conventions are very important So I can do something and you can understand the rules in which I created Some of these you'll find this used by some people You'll find completely different conventions and that'll be okay Whatever convention your organization uses learn that convention So now we're going to talk about how we put these keys in And then how we actually make the connections from one row to another row So now that we know what a primary key logical key and foreign key are We are going to actually start putting these together and creating tables That have these kind of values in them So when we were done we drew this picture that was sort of a logical model of How our data would be spread across four tables and how those tables are connected Now we have to take this and we have to map it in a way That leads to the column row the columns and the needed columns in each of our database tables And so here's what we do We basically have to take and for each of these When we're going to build a track table, we're going to build a track table We add a primary key So we just added an id field to every one of these things And that's so we have a place to store the sequence number of this particular row We have logical keys We've just marked those those are strings and then we things like You know rating length and count they just kind of go in here And now we have to model a relationship So we do is we in the table that the relationship starts from We put one more column in and this is the one I will name album ID And that just is an integer column that's going to record the album ID So there might be this might be 16 and then 16 goes in here So there's one of these columns that's a foreign key that points to this and that's why it's foreign This is a key that's not in the track table This is a key in the album table that we're pointing to and so there's a foreign key And that's what we have to do and we just do that over and over and over again And we quickly convert that picture that was a logical picture to having every table has a primary key And every time we have a starting point we have a foreign key foreign key and then foreign key And then we mark these things as logical key logical key logical key and we'll see how we do that And so that's the picture now we have a picture of exactly how we're going to lay these tables out in the fields that we need in these tables So we're going to do a create table statement And I've got this create table statement sitting there And so this one's going to be a little bit different We're going to say create table artist And the ID field is integer And it's we're going to add all of this stuff This is adding to the column to tell it additional stuff It's a primary key, which means we're going to use it to look up a lot It's automatically incremented, which means the database is actually going to provide this number for us as we insert records We it's not it's not allowed to be null. It's not allowed to be empty And it's supposed to be unique and then it's going to have the artist is going to have Um a name column the name column. That's just text So let's do that We already have our users and this is now we're going to do a create table in this sql and you can do that That's okay. That's totally fine and we have to get this right and we say away we go And so now if I take a look at database structure, I've got a users table as well as that that users table we were playing with before and uh this artist table Let me go ahead and delete this users table just to say goodbye Okay, so now we have the artist table and we take a look and it's got an id and it knows all about this stuff. Okay So That created the table We're going to keep doing this the next thing that we're going to show here is we're going to show The foreign key right so artist id is just an integer in some database languages like my sql and oracle You would put more stuff here to say this is a foreign key blah blah blah But in sql light we keep it simple and just say that is an integer column That's a foreign key the album table has a primary key and a foreign key and then the title So we'll go back and we'll grab that text out of my little page Let's create table Go back to execute sql And then run that And we'll continue with just the genre table has an id on it and um Primary key auto you'll just copy and paste these That whole thing you do that over and over and over again. So we'll go in here And run that one And so the last one we're going to do Is the track table and the only thing that's kind of weird about the track table is it's got two foreign keys Right it's got an album id and a genre id Once you draw the picture you just sort of literally translate these things It's got two foreign keys and a primary key. That's pretty much just like all those other primary keys And you know integer counts an integer and lengths an integer all that stuff And now we we've got it. So we take a look at our database structure We're going to see that our album genre and track are all set up And these are no columns that we just made with those create statements Okay So now let's some insert some data this first insert statement is kind of important to take a look at So insert into by the way the keywords can be upper or lower case table name Columns now this table has two columns. It has id and name But we told the database that id was auto increment. So it's going to actually give us the number We're going to it's going to assign the number rather than make us assign We could make it be one two three, but we say hey database. You're good at this Why don't you make it one two three? And so there is going to be a record that it adds led zeppelin. So let's take a look at that So We'll insert led zeppelin Oops Over to sql Insert led zeppelin and run it. So now if I look at database structure And I look at the let's look at browse data and look at the artist database You will see that I put led zeppelin in but this id field here was auto incrementing And so it it's it was put there by the database And now when we do the next insert Which is acdc And we take a look at the data We'll see that acdc is two now if you're writing this in a program If you're going to write this in a program You can get these numbers back from the database in your program But I'm not writing this in a program. So I have to remember that one is zeppelin And two is acdc So I'm gonna keep myself a little cheat sheet here to remember that Because everywhere else in the program that we want to say led zeppelin I got to say one now because the artist the uh the artist id of one means led zeppelin in those rows And so now we're going to go back and we're going to take a look at the next one And now we're going to put the genre in if you think about it We're working from the leaves out where the track will be the last table that we'll update because I have to define the keys For things like rock and metal and led zeppelin and all those other things and again Even though the genre table has two columns id and name We're only going to specify the name and let the database assign the value So i'm going to insert both of these and use a semi the semicolon trick Put a semicolon here and a semicolon there and run that And so if I take a look at my uh browse data and I look at the genre It's assigned one to rock and two to metal and i'm going to write that down one rock two metal I should have done something like rock and country because I can't even tell the difference between rock and metal but whatever that's My musical skill is not what's at issue in this class so now We're going to put an album in the album is the first thing that has a foreign key So if you remember the thing is the album points to artist and so that means it has a foreign key of artist id And so we have to explicitly say this because we're the system doesn't know Which artist who made who is but we know that who made who is acdc? And that's two and so we know to put artist id in so we'll say insert into album title artist id And so we have to know what this two number is And of course because we have our have the handy little handy little cheat sheet We can go over to execute and run that And I'll put a semicolon there and a semicolon there and run it and so now we have in the um In the album field we now have this and so these this was assigned and so who made who still have to write down that um Who made who is album one? And album two is led zeppelin four That makes it even more complex because the name of the album is at roman numeral four. I'm sure I can figure that out okay so The next thing that we're going to do is we're going to insert the track record Now if you think about the track record the track has two foreign keys and um And it's got a lot of stuff. It's got the title It's got the rated length count, but then we got the two foreign keys And so we have to know these numbers. So this this two one This two one This one two is the genre We're specifying the genre and the album that this track is from by those numbers now again We have to use this cheat sheet but if this was a program the program would know that one was zeppelin and um You know our one was who made who and two was led zeppelin four And so that the programs this kind of stuff is easier for the program to understand Then for us to keep track of and understand, but just just so we can get through these few records Um, and that's why I rely so heavily on my cheat sheet. So here we are all With all these numbers the the foreign keys are the tricky part here. Everything else is really quite straightforward So now I'm going to insert four records into my track table And then run that Okay, so if I look at browse data and I look at my track table This column here this id that's the primary key of the track table and then here are the two foreign keys Now now the interesting thing is now there is replication in these columns But the numbers are what's being replicated and that's okay Went a long time just not to put led zeppelin four in twice We could have made this a string but by making this an integer it saves tons of storage and makes it super fast That turns out to be One of the key things that makes databases super fast is using these integers So we take a look at all this stuff. We see that in a sense by using these little numbers We are pointing to rows and other tables the foreign keys are always pointing. They always point to their id So these foreign keys are out here This is the primary key up here and they always point to a row and another table And so we have modeled all those relationships and you will notice that in this entire database the who made who Only appears once the word rock only appears once the word ac dc only appears once What we have is we have duplication In our data, but we are duplicating the relationships i.e these little integer numbers Not duplicating the data itself and and in something that's the small it seems irrelevant But if you have billions of records or hundreds of millions of records, it's very relevant very very relevant So the next thing we're going to do is take a look at how you actually reconnect all this stuff together once we've sort of blown it out in these uh using these foreign keys and hand constructing all these relationships now how we bring it back together to show the data to the user So now that we've carefully constructed our relationships and the tables we need to reconstruct the data to show our users And you can kind of see how you would go pull this stuff together But there's a wonderful capability in relational databases called join that brings this all back together And so we have done this for efficiency of storage efficiency of scanning etc but we do need to traverse these foreign keys at times and The database software will do this for us automatically So the join operation basically is a way to specify in a select statement that you want to pull data out of more than one table And then specifying using what's called the on clause exactly how you want that data pulled out And so here we go we already have a A table an album table to the artist table And the foreign key and we want to in effect Pull data from both the album and the artist the album title and the artist name And we want to show that and so we we're going to say select which is the same select statement Here's a little different syntax. This is the list of fields. This is table dot field So it's the album title and the artist dot name comma there from the album And I always start with where the little arrow starts from album joined with So that is going to walk down this connection from album to artist album joined with artist Don't say with I just say it on and then this is the conditions upon which that join is going to happen When the albums artist id which is this column here albums albums artist id Matches think of that as is equal to or matches the artist id And so it it only connects the rows here when there is a match between these two tables And so if we look at this and we see that You know this one matches this one and this one matches that one and so it's the join connects Uh conditionally and it connects when the on clause is satisfied And so when this whole join runs, this is what we get So you select all this stuff now. This is an abstraction. Are you writing a loop? Are you doing two nested loops? How are you exactly bringing all this data again? We don't care about that because that's the beauty of sql That's the beauty of how we do this in a database So now if we we can just run this command. So let's grab this command Select track title genre name from track join genre that exact query case case of keywords doesn't matter And we go over here and we run this as sql And we run it And we get oops I got I went I got too far Let's do this one. So let's do that one there select artist name I have to add that one to my little cheat sheet the next time you see the cheat sheet. It'll be right So the title so this is coming from one table and that's coming from another table Okay And so that's one So here is something we can do that gives us a little more detail on that We can say So so this is where the connection and so you can think of the join is sort of spreading one table and connecting it to the other table And so what we're going to show here is it's exactly the same thing We're going to do is we're going to add these two columns so you can see where the match happens And so this this is one table This is another table and these are the these are the kind of columns in common even though they're not They're the columns that match. This is where the on clause is happening, right? We we have taken this table joined with this table on These two things connecting with each other so you can almost in some language some variants of sql This would even be a where clause so you connect these two rows, but only connect them when those two numbers match So so you can see I mean if we run this I'll just run this And again, you just see these this is where it connects Okay now interestingly We can see what happens and what the purpose of the on clause is If we omit it so this is exactly the same as that previous query Except there's no on clause so it's select all four of those fields from the track joined with the genre And so it's basically taken the track table And the genre with a join but no on clause. So it's not filtering for matches This is a match. This is a match That's a match. That's a match But we don't have an on clause so the matchiness doesn't matter And so you're going to get all possible combinations and literally if there were you know 10 on one side and 30 on the other side you would get 300 rows in that join So it'd be all combinations except the on clause reduces the combinations and you might think Whoa, this is really inefficient and I will say that's what my first reaction was when I first saw this but It's not inefficient. That's the beauty of abstraction. That's the beauty of sql. You say do it and and it just figures that out so um Let me grab this and you will see that we can run this one as well And that kind of gives you Why the on clause is important because now we have a whole bunch of these things and the on clause Just filters that out. So if we would just add the on clause back in Then that would only show the ones we showed on the previous slide So that's why the on clause is important. The join is like all possible combinations of all pairs of rows Between these two tables on is oh, but only where these two things match And you might think that it's inefficient, but the on clause turns out to be the way it becomes efficient Okay, so now we're going to Do the same thing where we're just going to take the track title in the genre and going to connect that to better together So we select this we're going to we need to join from one table join to the genre table with an on clause and so we're going to make those connections and the only thing we're going to look at is the title and the genre name and then run that And so we got this title and genre name now the thing you'll notice Is for the first time we now have replication of string data in a vertical dimension That's okay Because the data is not replicated in the database. The data is now replicated as a result of the join And so we are going to reconstruct what the user wants to see which the user originally all the way back to the beginning Wanted to see the duplicate information in the vertical axis, but now we're reconstructing it We didn't waste the space or performance in our database, but we still have to show them And so now the next thing we're going to do Is a monster We are going to reconstruct across all four tables And you might think this is really hard and and and sure it's going to be a little tricky But as long as you follow the naming convention and the naming convention makes sense We're going to do a select from the tracks title the artist name the albums title in the genre name From the track join genre join the album joined artists And so the joints follow the little arrows, right and then the on clause Qualifies each of those arrows when to follow the arrow and then this becomes pretty easy It's a foreign key the tracks genre id. That's a foreign t Equals the genre dot id the primary that's primary key That's a foreign key because I name it that way And I know that this goes to that genre table because I name it that way And tracks album id is equal to the album's id foreign key primary key And albums artist id is equal to artist id After a while you can type these pretty fast as long as you follow a naming convention and and you know the naming convention So this looks like it's really hard to do, but after a while it's really just a pattern So let's go ahead and run that one And it will Assuming we've done everything right Replicate all the data. So there's all kinds of vertical data now being replicated every column has vertical data Again, it's not in the database the select in the join and reconstructing Vertical data as it needs to be shown to the user And so if you've been following along Probably a couple hours later now We started with a picture that was our mock-up of what we wanted our user interface to look like and it had vertical stuff And we're like, oh, we can't put that in a database model And then we carefully build a database model that didn't have the data and then we're like, ah, we got to reconstruct it So we use join to reconstruct it And so after all that we went here with a clean little model with four tables all beautifully connected together And then we had to join it all back together. So join reconstructs it and again the key is The storage is efficient. The scanning is efficient And we still use the join to produce the output that we ultimately want with all the sort of vertical representate rep The vertical replication that our users really want to see so So one more kind of relationship. This is that was called a one to many relationship That was actually three one to many relationships and the other major relationship is what's called a many to many relationship We're going to do some code walkthroughs actually running some code And if you want to follow along with the code the sample code is here in the materials of my python for everybody website So you can take a look at that. So the the code we're going to look at is from the database chapter And we're going to look at tracks dot py So a lot of the lectures that I give in this database chapter are just about sql And this is really about sql and python. So this i'll go through this in some detail So the the code that i'm going through is in tracks. Uh, there's also tracks dot zip that you can grab that has these two things It's got this um library dot xml file, which you can export from your If you have itunes you can export this or you can just play with my itunes And so this is also going to review how to read xml. So we're going to actually pull all this data And this xml is In that that apple produces out of itunes is a little weird in that it's kind of key values And so you see key value pairs and it even uses the word dictionary And so it's like i'm going to make a dictionary that has this then a dictionary within a dictionary This to me would be so nice if it was json because it's really a list of dictionaries This is a dictionary then another dictionary then another dictionary and then the key for that dictionary and it's It's a weird weird Format but we'll write some python to be able to read it and so So you export that from Itunes and you you can use my file or you can use your file. It might be more fun to use your file And so here's tracks.py Um, we're going to do some xml and so we import that Uh, we're going to import sqlite 3 because we want to talk to the database And then we're going to make a database connection and in this once we run this you'll see that that file will exist And so right now if i'm in my tracks data That file doesn't exist But what we'll see is this is going to actually create it now remember that we have a cursor Which is sort of our like a file handle. It's really a database handle as it were And uh and and in order to sort of bootstrap this nicely We are going because this code is going to run all the time It's going to run and read all of library ducks xml and later things will we won't wipe out the database every time And so this i'm executing a script which is a series of sql commands separated by semicolons So i'm going to throw away the artist table album table and track table Very similar to the stuff we covered in lecture Um, and then i'm going to do the create table and i'm doing this all automatically And so and you'll notice this is a triple quoted string. So this is just one big long string here And it happens to know that it's sql. I'll thank you adam for that And so it creates all these things now It's not quite as rich as the data model we built because there's no genres in here And so it's artist album track and then there's a foreign key for album id and a foreign key for artist id Which it's it's sort of a subset of A subset of what we're doing And so that's that when that's done that actually creates all the tables and we'll see those in a moment Once we run the code then ask for a file name for the xml, right? And so that's what that is Um And we're going to i wrote a function that that does a look up that it's it's really weird because if you look at these files the Um, like in this dictionary, there is a key, right? And so the key of this dictionary This really could should have been a key value pair But so i had it there's this weird thing where the key for an object is inside of the object And so we're going to grab Uh For all that we're going to loop through all the children in Uh This outer dictionary and find a child Tag that has a particular key. And so you'll see how this works. And this was something I was going to use over and over again um And so the first thing we're going to do is we're going to just parse the string and this is the string And then this of course is an xml Et object And then we're going to say we're going to do a find all and so this shows how the find all We're going to go the third level dictionaries. We want to see all of the tracks And so we have a dictionary and a dictionary and a dictionary. And so what we want Is all of these guys Right all those guys right there right track id So we're going to get a list of all those that'll be the first one This will be the second one because the find all says go to the Find the dictionary key then a dictionary key a dictionary tag within that and a dictionary tag So it's and then we'll tell how many things we got And then we're going to loop through an entry is going to iterate through Each of these and see we'll get our name and our artist another one bites the dust queen And away we go and then the next next time to the loop will hit this one. Okay, so So then what we're going to do if is we're going to go through all those entries And if there is no track id and if that that's this track id field, where are you hiding? Track id We don't have that. We're going to continue And then we're going to look up the name artist album play count rating and total time Okay And And so Here they are play count a lot of those things that we had In the sample lecture that I did and we're going to look those things up And we're going to do some sanity checking if we didn't get a name or an artist or an album We're going to continue we're going to print them out and then We are going to Ask forget remember how you have to get the primary key of a row so you can use it So the way we're going to do this is we're going to do an insert or ignore And so this or ignore basically says Because I said that the artist name go up here I said the artist name is unique Which means if I try to attempt to insert the same artist twice it will blow up Okay, because I put this constraint on that Except when I say insert or ignore that basically says hey If it's already there don't insert it again So what I'm doing here is insert or ignore into artist So this is putting a new row into the artist table unless there's already a row in that artist table and This syntax right here You know the question mark is sort of the where this artist Variable goes and this is a tuple, but I have to sort of put this comment to force it to be a tuple So this is a way you have a one tuple And then what I need to know is I need to know the primary key of this particular artist row Now if this line may or may not have been actually done the insert And so I need to know what the id for that particular artist is so I do a select id from artist where name equals Now it either was already there or I'm getting it fresh and brand new So I do an artist id equals I fetch one row and it's going to be the first thing Given that I only selected id and so this artist id is going to be the id now Now I have the foreign key for the album title, right? And and so now I'm going to insert into the title artist id This is the foreign key to the artist table And I got this value that I just moments ago retrieved and I got the album title But this also is insert or ignore because now if you look I have unique on the album title Yep, unique on the album title. So that'll That'll do nothing. It doesn't blow up. It or ignore says don't blow up Just do nothing because this next line is going to select it and I grab the albums foreign key For either the existing row or the new row And then I'm going to insert or replace and so what this basically says is if the unique constraint would be violated This turns into an update Now not all sql's have this but sql light has this that basically says insert or replace Some sqls are totally standard some things we do like this is this this select statement is a totally standard a part of sql Then they insert is totally standard But insert or replace and insert or ignore is not totally standard But that's okay. It works for sql light, which is what we're doing and so we have the title album id length rating and count And then we have a tuple that does all that stuff and of course the The title is unique All right, the title is unique in the track table as well. And so we've inserted that so the clever bit here is Both dealing with new or existing names in these three lines and we see that pattern twice here Where we're doing that okay, so there's not much left to do except run this code Hopefully it runs python 3 tracks dot py and library dot xml whoosh Okay, so that is my um So so we we found 404 of those dictionaries 3d dictionaries and now it's starting to insert them insert them insert them And we can take a look at so we do an ls minus l Or dur on windows. We'll see that we made a a track database We extracted the data from this library and we made a track database and we have all these foreign keys So let's go and take a look at the sql light browser file open database track dbsql light And come on up. Where'd you hide? I've got it minimized. So there you go. Let's look at the database structure We have an album. This is the structure Artist in track. We don't have no genre And this is all like we did it by hand except python did all this work for us If we take a look at the data and we start from the outside end We have the artist names and their primary keys Right, there's the artist names and primary keys And then we have the albums And we have the artist IDs See the artist IDs how nice those are So we have the primary key here and the foreign key there and then we have the title and if we get to the track Uh, we have the album id in a way we go So if I was clever, I could be able to type some sql Oh great If I was smart, I'd had this in a paste buffer so uh select select track dot title album dot title artist Dot name, I think Artist has names and albums have titles. Yes Okay, so I can do that from track Join album Oops album Join Make that a little bigger Album, track, join, album, join, artist I need to on clause and I can say track Dot album ID equals album notice how I know the name that I named these things and album Dot artist This is so great when you use a naming convention artist ID Really, I think that might work So let's just see what we get when we type that into the sql box here execute sql run Yay, I got it right the first time Right, so that's basically my nice little joined up track list Oh, I'm so happy that I got that right the first time Okay, well you can so you can play with this yourself um Play with this tracks maybe make an export of your own uh iTunes library and run it with that and so I uh, I hope that you found this particular, uh, bit of code useful. Okay Cheers So our last major topic is called many to many relationships and up till now everything that we've done is what's called a one to many relationship And that is there are many tracks associated with one album There are many albums associated with one artist There are many tracks associated with one genre And you can think of labeling and as you look at data models They put little labels on each arrow that tell you which end of the arrow is the many And which end of the arrow is the one And so in this case the foreign key is pointing to There are many of these rows over here many rows that point to one row over here So it's a many to one relationship There are various ways sometimes sometimes I'll put two arrows at this end and one arrow at that end But whatever it is This kind of thing we've been showing is a many to one relationship and that's probably the most common thing But there are times when you just can't model things with the one to many relationship Um, so like if you have a mother and children Well, that's uh, that's a many to one relationship and it's just fine and that works fine But sometimes You have a many to many relationship and that there might be many books One book has many authors and uh, each author has many books and so you don't have like the one side There's no one and so you have to end up building a table that what we call I call it a connector table They call it a junction table on wikipedia But we need a little table that allows us to break a many to many relationship into an effect Too many to one relationships and a connector table and so this is a connector table So you could think of this as you know, there are many many links here But we don't have a way to model the many over here to here And so what you do is you basically say, oh, there's a lot of these things There's many that go to the one the many that go to the one and then in here You sort of create that manyness that you want to create So it's probably just as easy to look at a sample of this So let's imagine a learning management system Where you're taking a class and there are some people that are teachers and some people that are students and many students are members of many classes A student can be part of many classes and a class has many students in it So you can't really find the one End and so what we do is we make a table called membership and in that table of membership We actually often don't put a primary key in at all. We simply put in two foreign keys And if we're going to put a uniqueness constraint, we put a combination of the two Foreign keys as the uniqueness constraint. So we say There could be duplicate user IDs and duplicate course IDs, but there can only be You know, user ID course ID combinations. That has to be unique. So you can make unique be more than one One column And so if you imagine a course table and a user table, there's a user ID the name and email and the course has a title and an ID and then we have this little table that just is the connector table that shows The points out and so we can expand this membership. So let's take a look at how that works so we're going to create some tables and the These are very classic tables because these are the one end of it. So these are the one end of it So it has a primary key a title a logical key email There's a primary key for course and then there's text. So we add this unique to kind of indicate that it's a logical key We're not going to allow ourselves to put any duplicates in here. Now that the connector database here is A table member and it has two foreign keys user ID and artist course ID And you can each model some data here. So I'm going to model role Which is going to be zero equals student and one equals instructor And then I'm going to indicate that the primary key or uniqueness constraint is the combination of the user ID and a course ID Now when we say the primary key It it both limits our ability to insert duplicates But it also allows the database to optimize it scanning because it knows that that combination Is always unique and so it can organize its disk structure and storage structure To understand how to look things up more efficiently knowing that once it's found a user ID course ID combination It doesn't have to look any farther because they're unique And so all of these contracts that we add speed things up Save storage and makes things more efficient. But in ways we don't always know exactly how they happened And so let's go ahead and make these Let's go ahead and make these guys I think I will start with a new database I'm going to call it LMS for learning management system No, I don't really want to do that one And so I'm going to not create the table. I'm going to do everything in sql And so let me see if it's in my cheat sheet. Nope, that's not in my cheat sheet So I'll have to fix the cheat sheet again for you by the time you see a cheat sheet All these things will be in there. So I'm going to go in here and I'm going to grab create table user Actually, I'm going to grab them all watch this Grab them all highlight all these Go over to sql browser Blast them all in And then I'll put a semicolon at the end of each one of the statements And I want to run them. So did I look does it look good? Yep. Yep. Yep. So I got a course I got membership to foreign keys and I got user so that all looks good Okay, so now we're going to have to insert some data in And we're going to insert from the outside in and so we're going to just put the name and email The id will be automatically assigned for the users And we're going to do the same thing and the id and the courses will be automatically assigned So let me just grab all this stuff Go into sql that has the semicolons at the end already for me. Thank you very much Now I'm going to run it And if I take a look at my data now, I've got primary keys for the courses and I've got Premier keys for the users and I've got nothing in the membership table And I have to I of course have to remember what these values are because Jane is one and Ed is two and Sue is three, right? And Python is one sql is two is three. And so when I go into membership I've got two foreign keys here and a roll and they just have to be for the course person combination And so it's a little tricky to figure all this stuff out But again, these are just numbers and if you look at these numbers user id course id roll Well user id one is in course one user id is in course as the teacher User id two is in course one as the student, etc, etc, etc So I'm making these connections by just putting these little numbers in and once again conveniently I have all my semicolons perfectly in place So I go to sql And then I run that And then I take and I look at my membership data and there it is So two foreign keys and a bit of data modeled at the connection. That's the way we say that the roll is modeled at the connection So now we build all this stuff up. We can write some queries that take a look at this And so what we're going to do is we're going to look at who's in what course and what role are they? And we're going to Sort this in a nice way. So let's just take a quick look at the code of writing We're going to do a select from three tables the username the member roll the course title So we're in effect. We're not showing any of the foreign keys or the primary keys We're going to go from the user table joined to the member table joined to the course table This is pretty easy to write. You know, there are three tables you want to go across the on clause Is also very easy to write right the on clause models each of these connections where the members user id is equal to the user's id and Where the member's course id is equal to the course id So we're going to connect and we're going to concatenate all three of these tables together But we're going to only keep rows that where it matters now that this is not this Roll doesn't participate, but we're going to print that out And we're going to order it by the course title First and then the member roll second and the name third and so let's run that So we've reconnected it So ed's the teacher of the php class sue is the student the php class Jane is the teacher in the python class adds a student and sewer students of the python class Ed's the teacher in the sql class and jane is the student in the sql class and so we have Many people are in there are many students in many classes there and so we have modeled that But we model that with this sort of table And if you look at a piece of software that i've written called sugi, which is a standalone learning management system That's built with learning tools. You will see in anything we're in membership where we have a We have a user table We have a context which is also the course table and then we have a membership table and you look here's these foreign keys It's kind of like that's the many side. That's the one side many to one And so this you know, this is now a an effect of many to many Court between these two But then it's modeled as a series of many to one many to one Relationships and you see this all the time in all kinds of things Where membership or other kinds of things are necessary many to one are many to many so With all that there's so much to learn It's it's both easy and complex at the same time It's easy when someone shows you how to do it. But at some point you will learn how to build database models and you realize Oh, it wasn't so bad. It takes a while to get used to them. Um, this really just as a quick walk what the the bottom line is The what we just did Seems like it was while that's nice Do you really have to do that and the answer is if you're going to scale at all? You absolutely have to because you simply can't read and write data sequentially You can't read through a update one little piece of data in a file by reading all the way through and then writing a new copy the file that could take seconds and In a system like an online system you get a hundredth of a second to do something like that and the databases make it So that happens in a thousandth of a second. So you ultimately you simply have to take advantage of this. You just Can't if you're going to modify it you can read data from flat files But even if you're going to read a lot of data if it's big it slows down terribly So it might seem like there's a trade-off that you could debate whether this is worth it But if you're going to deal with a lot of data, it's you've got no choice. It's really not as much a trade-off as you think So this has been a quick romp through databases. We talked a little bit about indexes There are constraints. We talked a little bit the not null stuff. We've talked about that uniqueness. That's a constraint Another whole area is what's called transactions and that's the locking of little areas So you can read an area then lock it and then update it to make sure no one else reads it And and so they make sure they either get the the version before You looked at it or before you change it or after you change it. And so that's how You make sure that you can't do things having to do with bank account Balances and gets yourself in trouble. So these are a lot of sql. It's really fascinating sql is a fascinating thing to use and learn and performance tune and enjoy So Relational databases are cool. This gets us started The the big thing is don't allow replication vertically of string data pull that out into a separate table Establish a primary key and then have foreign keys that point to that primary key It is not just how much data you store It's sort of a compression way as a way of compressing data You might think strings take no data, but they do numbers take a lot less data and it's both how much data that's stored but also how much data has to be scanned and That way joins work. That's part of the magic of why oracles such as such a successful company It's a bit of art form and it's something that you can work your whole life And always get better at Hello and welcome to our code walk through on the roster code So the the learning objective of this is to do a many to many table And so the idea is is that we're going to just like we talked about in lecture We're going to have a set of users We're going to have a set of courses and then we're going to have a connector table or a many to many table That basically has two foreign keys So we are going to use the integer not null primary key auto increment unique As the way to get auto assignment of the primary keys in the user table and the course table And then we're going to say that the name which is like a logical key and then the course title We're going to mark those as unique and we're going to we're going to take advantage of that in a moment And so you'll see how we take advantage of that So what unique means is if you try to insert the same string into this column You know like chuck twice then it's going to fail the second time because it's going to refuse to create a new Record and so if we just kind of like take a look we're going to get our roster data from this sample json Which is just an array of arrays And this is the person's name the class that they're in and whether they are a teacher or a student And so we're going to read that so we need the json library in the sql library We make a database connection and we get a cursor The connection the cursor is they kind of more like the file handle You send sql commands to the cursor and then you read the cursor to get the data back The connection can create more than one cursor to so you can have more than one Set of commands, but the cursor is generally like the file handle to the database server And we are going to execute a big script and you'll notice this is a triple quoted string That goes all the way down to here And so this some people would just give this to a unit text file and have you cut and paste this and then go run that In your sql. I browser to create them But that's okay because what we're going to do is we're going to set this up It will either Reconnect to existing file name roster db dot sql light and if I look where I'm at I do an ls We find that that file is not there So the first time I run it it's going to create it But I'm going to I want this to start fresh every time so I'm going to wipe out the tables if they exist That way you can run it over and over and over again in case you make a mistake here Now I don't have a mistake or hopefully I don't have a mistake on this So we're going to create we're going to drop three tables and we're going to create three tables and um Here we're going to create the table that has two foreign keys User id course id that are sort of going outwards From the member table and then we're going to model a little bit of the data at the roll and I guess this and again This is straight from the lecture um And the primary key is actually a composite primary key because we're going to look up And it's going to force this to be the combination of user id and course id to be unique But there can be many user ids and many course ids But only one particular combination of a value for user id and course id and so that's what we're basically saying You can be a member of a course, but you can only do that once you can't be like a member of the course a bunch of times So we're going to Oh, that should be roster data sample. That's okay to oops fix a bug Save that roster data sample And so that's just this file and it's really just an array And then each row is an array and it's a way for us to get this roster data in And so so once we do load s on json We're parsing it and then this is going to be an array of arrays and so For entry in json data So entry is going to be one of these things. So entry itself Is a row So an entry sub zero is the name and entry sub one is the title Name that's the sub zero and that's the sub one of the particular entry that we're looking at And we're going to print it out just for yucks as a tuple So we make that's what the two parentheses are this inner thing is a is a two tuple and We're then going to take the person And we're going to do an insert and this is new or ignore So what the or ignore means is if this insert would cause an error Please don't blow up Don't just ignore that. I tried to insert it and so this is our trick and it's a beautiful trick It's like a gorgeously beautiful trick Here if we insert the name chuck twice Or ignore will just mean that nothing happens meaning it's already there Okay, so if it's already there if it's not there, it'll put it in And the unique will guarantee that it only goes in once so we just an effect Always attempt to insert it and if it's been there once then it's all set And so this insert or ignore is a super powerful mechanism. Um, I use it all the time and We have a uh placeholder in the form of a question mark And then we have so one of these days we'll have uh two things that we're asking for as a matter of fact Here it is. There's a tuple down here, but this is kind of a tuple with one item in it name And that name is then going to substitute in for there without While avoiding uh sql injection So this runs it may or may not insert a new record, but if the this chuck or whomever the name is Is not there it will give us new record and then we are going to get back the id and so This is the logical key and this is the primary key and that primary key is going to be auto and auto constructed for us And so we need to know what it is So we say select id from user where name equals and then that same name So that's chuck and so that gives us one and then what we do Because we're going to fetch one record from the cursor because that's a select and it gives us back a cursor There's only hopefully one record there because it's unique I could put a limit one in there But that'd be kind of redundant because it is the name is a unique key And then the sub zero just means if there were more than one thing that I was selecting which we'll see in a bit The sub zero is just the first thing and so this is going to give us The integer user id that was a sign Or if we're coming through later for chuck, you know chuck later charlie Later that will be the old one So this is inserted if it doesn't exist And this is get the newly created id field or The original id field and so part of this works by having both a Logical key and a primary key the primary keys auto generated But the name is a logical key and it's unique And so that's our trick to get get that Assigned thing before we just looked at it in the user interface of escalate browser And wrote it down, but this is how we do it in code So we need to know what that key is whether it was new or not And then we do the exact same pattern for the course except we're inserting the course title. So that's no big deal And uh, and so we're going to get the user id course id And then what we're going to do is we are going to insert a replace So this is it basically if they're remember that this user id course id combination Is the primary key for this member table If there is a duplicate If this is this combination is already there this becomes effectively an update statement And we have these two number values now. What's missing here is the role is not there Um, and so user id course id This is the sql bit And now we have a tuple with two items in it and that's because we have two question marks And then we commit it and as I mentioned before Sometimes you want to commit every time through the commit is Turns out that that these things are less costly, but that's because it's not always writing all the way to disk Whereas when you enter the commit it's going to go and write everything to disk Pause until it's complete and then your program doesn't continue. So sometimes we don't run this every single time through Okay, so let's just go ahead and run this the only thing we're going to see is the output of the name and the title as it's running so If I do python three roster dot pi Hopefully I can hit enter. So you'll notice by the way that this sql light now exists, right? And it has no data in it. So let me see if I can open this database and see it So you see that there's no data So so we're the code we've run this code In effect up to this point. So we've done all the create tables and all that stuff And so the create tables are there so all this data is here It did it We haven't started putting any data into it yet because so we look at browse data We're not finding anything in here. Okay There's no data to browse Now hopefully we won't have locked ourselves because we are sitting Right here and when I hit enter over here, then it's going to go It's just going to run it really fast. So I'll hit enter It'll read it and so it inserted all of those things and now it's been changed and if I hit refresh over here We will see in the user it just sort of assigned user IDs, right columns auto assigned We will find in the course that those courses are all auto assigned There's the courses and it there's no duplicates because this is unique, right? And so these are the newly created things but then membership Is user ID course ID and so again the primary key as it were the unique constraint Sush primary key is the combination of these things and I haven't put anything in roll And so if you scroll through these you see all of the users who are members of the courses that they're part of Okay So there you go and I'll leave it I'll leave it up to you to come up with the join I'll leave it up to you to figure out how to put the roll in but I just wanted to kind of give you a bit of a walkthrough of this code base and in particular the tricks of The uniqueness keys the auto increment keys the logical key uniqueness and then this kind of composite primary key And then the trick of insert or ignore And then the quick select that comes right afterwards to get the newly generated id or Or to get the old id you can sort of sort and then insert or replace Which is a combination of a insert and an update So I hope you found this this example useful and can Can apply it and basically create many to many tables We are doing some code walkthroughs if you want to follow along with the code you can download the source code From python for everybody dot the python from everybody website. Okay So the code we're playing with today is twfriends dot py And this is a step beyond the Simple tw spider. It is a restartable spider, but we're going to data model things a little bit differently We're going to have two tables and we're going to have a Many to many relationship except that it's sort of a many to many relationship between the same table, which is okay um friends is a twitter friends are a directional relationship and so uh So we start out here in twfriends dot py remember that the file hidden dot py I'll show it to you, but i'm not going to open it because i've got my keys and secrets in it So this hidden dot py file you got to edit that and you got to go to apps dot twitter dot com and get your keys and put them in there Otherwise these things won't work But if you have twitter and you set your api keys up and you put them in hidden dot py Then all these things will work. It's kind of fun actually and impressive Not hard to do actually So The twitter url. That's my library that reads hidden dot py and augments the url and does all the oauth stuff json and ssl because twitter doesn't i mean because python doesn't accept any certificates even if they're good certificates, so we kind of crush that Here's our friends list that we're going to hit. We're going to make a database friends dot sql light Um now here we're doing create table if not exists So what this really is saying is I want this to be a restartable process and I don't want to lose the data We're starting out We do not have sql light any sql light files And so this is going to create the database and create these tables But the second time we run it we're not going to recreate the tables We're not going to we're going to be able to restart this because we're going to run out of um We're going to run out of a rate limit before we finish this But so we just have to wait however long the rate that template takes to reset and we'll watch the rate limit go down And so we're going to have a people table and we're going to have an idea a primary key in the name The name is going to be unique and whether or not we've retrieved it and that's kind of from a previous one But then there's the who follows who? Um the from id to to id and so this is a direction And we're going to put a uniqueness constraint in just like we do in many to many's that basically says The combination of from id and to id has got to be unique. We don't allow ourselves to Put duplicates of the combination So from id can be one in many records and two id can be one in many records But one one is only allowed once And this is the crud we have to do to convince python to accept the twitter certificate And so this is similar to some of the other stuff that we've done We're going to enter a twitter account or quit and if we hit enter by itself Then we will actually go and retrieve a record that was not yet retrieved And now we're actually pulling out two values id and name And so we will we will grab fetch one is going to give us a two tuple basically And we're going to store that in id and account of course That's like this is this is coming back with a two tuple First of which is the id from the database limit one means we're only going to get one of these Or zero of these if there are zero of these that means there are no un retrieved twitter accounts retrieved equals zero Well, you'll see in a second that all the new Accounts we put in are the ones for which we haven't retrieved and again given that our rate limit We want to know which ones we've retrieved Okay, and And so what we're going to do next is we're going to check to see if the person that We just checked which means the length of the account is greater that we just were entered We're going to check to see if they're already there, okay And we're going to select id from people where name equals. So that's the one we just entered And we're going to fetch one and grab the first thing because we only we only got one thing in the select statement here um And if this person that we just asked to see Uh is not in the table that means this is going to fail We're going to do an insert or ignore this or ignore is kind of redundant because we just checked to see if it was there But we'll put that in just to be safe And we're going to put the name in for as the new the new account that we're looking at And we're indicating that retrieved is zero so that we will we will know that we haven't retrieved it yet You'll see that we'll update that in a second. We commit it so that later selects will see this so that So you got to do the commit This later select wouldn't see the one we just inserted And we're going to ask how many rows were affected and if it's not equal to one Then we're going to complain about we inserted it and we are going to do this thing. We're going to ask. Hey Remember there was an id up there Do do do Right here id integer primary key and we did not insert this here But we want to know what that id is and every time I was showing you that in lectures I was saying it's really easy in python to do this and that's what we're saying This cursor did the insert But one of the things happens is after the insert we're going to grab the last row id Which is the primary key that was assigned by sql Okay, and so that means that one way or another coming through this code here in line 45 one way or another We're either going to know the id of the user that was there before or we just inserted one And so we're going to know the primary key of the current user and you'll see why we need that So id is the primary key of the current user that we entered right here Okay And now we're going to do is do the twitter url augment with the oauth and all the keys and the secrets and hidden dot py Instead we're going to go through let's count 1000 let's go count But what the heck let's go 200 up to 200 friends Save no, let's do 100. We'll keep it that way and then we're going to retrieve it and We're retrieving the account. We're not going to print the nasty url out. We could Then we're going to open the url with the connection and then we're going to read that and we're going to get the utf8 data from this and then we're going to decode that And we're going to have the unicode data so the data string is a internal python string With all that data representing all the wonderful characters And of course, we're going to ask url open to give us back the headers As a dictionary using this call and we can see what the how many we have left For the remaining right what's the remaining rate limit that we have Okay, and so then we're going to parse the data with json load s if uh Oh wait, I need to continue in here continue Okay, save um If we are going to parse this data we'll print it out, right? So that means that this this died which means it's not syntactically correct json basically And who knows if we're ever going to see that but at least when it blows up it'll print this data out We'll have to catch it and then it'll continue. Actually, I'll make this a break Because if that's blowing up that bad we should quit Now We don't I don't yet know what happens when this rate limit says you can't have it And so but I do know that I expect when it's successful that there will be a key of users in this outer dictionary that we're going to get and if this outer dictionary that we're going if we if users is Not in the parse dictionary Then I'm going to dump out this data so that at least I can debug what happens when I've got some broken json, so the difference between um This code this code is going to fail when the json is syntactically bad meaning a curly brace isn't right or whatever This code will trigger when I get good json, but I don't have a user's key in it. Okay So then Once we've retrieved it we we're pretty happy with it We're going to update for our account that we are retrieving. We're going to set this is one of our retrieved accounts Okay, and then what we're going to do is write a loop that goes through all the friends of This particular user that we're asking and gets their screen name prints it out and then we're going to check to see if This one is already in our people database because this is a spider. We're grabbing accounts And uh, and so we'll do a friend id And do a fetch one grab the sub zero thing And if that works if if this person's not in there this fetch one is going to blow up Which means we're going to drop down to the accept code But if it does work, we have friend id is the You know that we they they're there and they're already in our database, right? They just weren't retrieved Okay, and so Now if we the friend id wasn't there we're going to do an insert into setting retrieve to zero And then we're going to commit right now Remember row count is how many rows were affected by this last transaction curve dot row count And we're going to die if that doesn't insert doesn't work. This is unlikely Unless somehow we've ran out of disk drive or something and we're going to grab the friend id as the as the key The last row that was inserted. We're only going to insert one row So it's basically the primary key of the row that we just inserted. So if you look at this code right here It comes out the bottom one way or another with friend id successful Right friend id is either they're already in our database Or they're not and if we insert them then we have it and so now This count new and count old is just so I can print out a nice print out Now we are going to insert into the friend table, which is called the follows table in this case From id and to id those are the those are the two outward outward pointing uh foreign keys And we have the id of the account that we are retrieving the friends of and then this particular friend And so we're inserting the connection from this person to that person And then we commit it we want to commit these again so that later selects when the loop goes back up Later selects get all of that data that's going on Okay, so we do want to commit from time to time and then we close the cursor at the very end. Okay So let's run this and see what happens Okay, so python twfriends Dot py Oh Of course I am a refugee from python 2 so I always forget to type python 3 Okay, so we're going to start If we take a look right now, I'm going to start another tab over here and ls minus l star sql light Now that sql light file is there right and it's actually made the tables if you go up here It ran all this stuff create the tables yada yada, and we're sitting right here at this line As a matter of fact, I think without causing too much trouble. I can open that database And get into this database right here and there is no data in the follows table And there is no data in the people table. It's completely empty. Okay, so we're waiting for the first one And I'll go with mine dr. Chuck So it's retrieving the 100 friends and they all were brand new they're all inserted right Um, and so now if I hit refresh We will see that dr. Chuck has retrieved um Who follows so these are all the people I follow for one follows two So if we look at here, we see that dr. Chuck follows Stephanie teesley because we grabbed the follower as a dr Chuck, you know, we're going to have a record in all of the follows For all the ones that I did right so these are all the people I followed And we put them in So we can go back and We can let's see grab somebody. Let's go grab Stephanie teesley And let's pull out her friends So we grabbed 100 of her folks. I got 14 left. That's my x-ray limit. So I did Stephanie teesley. So let's go back here So you'll notice there's 101. There's probably going to be oh 182 That's interesting. So we've retrieved dr. Chuck and Stephanie teesley And let's go take a look in the friends table the follows table Okay So we have all of the people I follow now all the people Stephanie follows Okay, so there we go. So let's go ahead and do somebody else. Um, let's see. I think we both follow Tim a k. Where's tim a k? Yeah, let's follow tim a k. Let's see what who tim follows see if we can get like an overlap Oh, we revisited some let's see if we can see this in the follows See people So we've got dr. Chuck retrieved and tim a case somewhere down here It might take us a while before we Get any really good overlaps Let's see Let's do a database call Let's see. Let's do a database sql select Count Okay, so let's just run this some more. It's clearly working now. One thing I can do here is I can hit enter And it will just pick one randomly So it grabbed live edu tv and now I can and let's see how many I got left We got 12 left and now I can hit enter again and it picks another one That was the next one. I was kind of picking them in order. Is it picking them in order? Let's go to people Yeah, it's picking these so it's gonna we can see that it's going to just do the first unretrieved person Who's Nancy? Let's let it retrieve Nancy. So it grabbed Nancy new So we're finding some and this table is getting really big and so if we look at the people table, we now have 455 people And we have 467 following records And so there we go Oops Hit enter it does another one And away we go. So you get the idea. I can type quit to finish And just to give you a A little interesting Bit of code to show you how to do selects. I'm going to do this tw join now. You'll notice that we're not talking Oh, let's show you one thing um ls by nacelle friends Star sql light So this database has it so I can restart this process And run it again and the database is still there. And so we just grab swear track um And so we can keep doing this and and so this data it keeps extending and so this is a restartable Restartable process I can run it and then tell it to grab the next unretrieved one. And so away we go right and um So that's part of it. So so I can if I run out of my Uh, I've got eight left. Oh, how many I've left really let's keep going How many do I got left? I got five left Okay, wait. Oh, I guess we'll just run it out. So I got four left You know, I should do is I should I can't change the code I guess I can't change the code I can stop the code and I can quit the code So what I'm going to do is I'm going to change this code a little bit really quick And I'm going to print the headers of rate limiting at the beginning And at the end So now I can run it again. I changed the code. Hopefully didn't make a python error Tell to go get another one and in devarro. And so I got three left We'll see what happens when I run out of rate limit Run out of rate limit So we have one left hit enter Hit ctrl k Open source org so we have zero left that worked Now let's see what happens. I don't know what happens next Oh, we blew up too many requests So we got an ht to be error 429 So that means that Going for mark cuban Uh That was in line 48. So the right thing to do would be in line 48 Um We should really put this in a try try accept blocked Try accept block Because it gives us an error print Oh fiddlesticks How do I print the exception message? I always am forgetting print Failed to retrieve Okay, so we'll put that in Now if I run it And then I have to put a break here because that's not good break Failed to retrieve not got to figure out. Oh, I see I never know how to print out the error message Yeah So I have to I never remember see that's the weird thing about stuff is that I don't ever remember enough I don't remember the syntax what I say here To print the error message out So I'm going to go to google and I'm going to say print out the exception message in python Print out the exception message in python. Oh python 3. Hello Okay, so let's go find it here in the documentation Is this it? Is this what I say? I just want to print out the message Ah, that's it accept Let's try this So this is part of python programming is like for me at least because I'm just not like a genius expert at this stuff This is what thing I like about python Is you can guess stuff and sometimes you guess right so there we go We got the error We got the nice little error message and we see error 429 too many requests So that cleans that up nicely So we are we have run out of requests And on that it is a good Uh good time to to say thanks for listening and uh, I hope that you found this valuable And welcome to our final chapter Retrieving and visualizing data in this chapter. We are going to basically Bring this all together databases web services code loops Logic and and we're going to solve a problem that is a multi-step data analysis We're going to find some data on the internet might be html might be an api or whatever And we're going to write a relatively slow process that's going to Pull data slowly because these are all rate limited. This is a slow and restartable process So you have can be a start this and what we're going to do is we're going to have a database That's going to hold the data that we're pulling and so this might take several days actually If you really have to do it and then you'll build up your data in your database and then What you tend to do is you tend to produce two databases. One is kind of a raw database that You know is you really it's all of its data Columns are aimed at helping you figure out what you've got to retrieve yet and what you haven't retrieved yet So that's kind of a crawling spidering process And then you find that the data is kind of nasty and ugly and you find that before you're going to do any analysis You probably want to clean and process it so you in a lot of these you're going to go from a raw database to a clean one And this is going to be really large and this is going to be really small And and you're going to do this sort of once but slowly and you'll do this as many times you need changing this program Cleaning the data up over and over and over again And then you'll end up with really clean data and that's relatively small And you might run programs that'll loop through this to do visualizations or analysis or some things Or whatever and so you'll actually sort of use this database as a source of information okay, so That's the basic pattern of what we're going to work with now This is what I call personal data mining and if you're going to do this Seriously python is used in lots of data mining activities But if you're going to do data mining seriously with really really large data sets, we're doing small to medium sized data sets As you might do sort of for an individual personal research versus like an organizational research where you're processing The logs of a web server or something like that And there's lots and lots of wonderful technology and what's really cool is This technology just keeps getting better and better because the whole data and mining data analysis Natural language processing field is just so hot right now. It's so awesome we're going to keep it simple and do stuff for ourselves for now and And and I gave you a bunch of sample code It's going to make it so that you can adapt this sample code to solve the problems that you need to solve So like I said, this is more of a programming exercise data mining might be a lot more complex If you're doing simple research, this might actually model what you do pretty well So the first thing that we're going to do is what's called use the google's uh json api for geocoding And uh, there are two versions of this one version requires a key and one version It doesn't require a key Google used to make all this data available for free but with just a rate limit But now they're making increasingly requiring a key So I give you code in this zip file that kind of does both If you really wanted to do something in production of taking User entered places and names and getting precise latitude longitude coordinates So you can produce a nice little google map like this Um and But if since google has made a rate limited api, I've actually Prespired a copy of a google data and I have my own sort of fake google api And so you you can do your assignments and test all your code using my fake api Um, which has no rate limits and and has no problems, but the it's only a limited set of the data And so this is the basic process and it's it's one of those things that it's it follows that basic personal data modeling Personal personal data mining pattern. And so here's this api, which is either google or me I've got my own dr. Chuck version of this dr. Chuck net version of this and there is a An input queue of the location So this is the user data where they just put in the name of where they think they live university of a tube again or something and So this is the queue of the things that are to be retrieved And in in my case when I built this map for the first time there was like 15 000 And I it took me days to get this and so it would stop and so what I would do is I would you know Read the first one into this geo load dot py check to see if I already had it in my database If I didn't already had database I would go into the api and pull the data down and I would put it in the database And then I'll go to the next one the next one the next one And so you know I might get a thousand in my database and then it blows up or I'm told I can't go any further So I wait 24 hours I started up and it reads the first thousand and says oh, they're on the database already and then it starts at 1001 And then it adds that and adds that and then until it stops and so it took me Several days of processing to get this data right now I didn't have a separate cleaning process because this data is pretty simple I was pulling out the the json latitude and longitude etc And so I didn't have to do two separate processes to clean this data up It was clean enough right as I pulled it Because I was talking to an api if you're talking to the html sometimes it gets nasty and ugly And so then I wrote this program that just reads through it It just does a select and you know reads through the stuff and it prints out some summary information and tells you what to do it also prints out you'll see this pattern because um, You know, I'm I'm divisualizing using browsers html and this happens to be using the google maps api um And putting all the data in a little javascript file So these end up being assignment statements in javascript you can take a look at that file and uh all the data Shows up as assignment statements in the javascript and then when this html loads It reads this file and puts up all those pins as long as you have access to the the in browser javascript api So the next thing we're going to talk about is page rank, which is spidering now html We talked a lot about this spider html get some links and so up next we're going to actually build a real database Full-featured search engine using page rank This is another worked code example You can download the sample code zip file if you want to follow along And the code that we're working on today is what I call the geodata code And that is a code that is going to pull some Some locations from this file We're we're simulating or using the google Places api To look places up and so we can visualize them on a map and so this is the basic picture If we take a look at this where dot data file It's just a flat file that has a list of organizations and this actually was Pulled from one of my moog Surveys we just let people type in where they were went to school and this is just a sample of them So this data is read in by this program geo load dot py And if you recall this google geodata has rate limits It also has api keys, which we'll talk about in a bit too And so the idea is this is a restartable spider like process And so we want to be able to run this and have it blow up and run it and start it and not lose what we've got Right and so this is unlike is some though So we're not now using a database as well as an api But in order to work around the rate limits of this api We're going to use the database with a restartable process And then we'll make some sense of this and then we'll visualize this but In the short term, let's start with geoload.py code geoload.py take a look here. So A lot of this hopefully by now is somewhat familiar to you url lib json sql lite And so I mentioned that the google apis these used to be free and did not require an api key But increasingly they're making you do api keys for especially New ones and so what happens you you can go to your google places. I mean go to google apis and get Uh get an api key and you can put it in here It'll be this long big long thing that looks like that And then if you have an api key you can use the places api And I've got a copy of a subset not all of it a subset of it here at this url As a matter of fact, you can just go to this url in a browser And it will tell you a list of the data that it knows about okay And um and and I made it so that that does the same basic protocol with the address look You know address equals as the google places api So this will just change how we retrieve the data either retrieve it from my serve nice thing about my server It's got no rate limit It's really fast and you're not fighting with google all the time And it means that perhaps if you're in a country that google is not well supported you can use my api I mean that's really strange that somehow my api is More reliable and available than the google one, but it's true So we're going to make a database We're going to do a create table if not exists and we'll have some address and we're really just caching the Geographical data we're going to cache the json one of the things we do when we build these processes is we tend to simplify these things And not do all the calculation and parsing the json Just load it and get it in and load it and get it in and fill the data up in this database And so that's what we're going to do um Because python doesn't ship with any legitimate certificates. We have to sort of ignore certificate errors We're going to open the file and um We're going to loop through it and pull out the address from the file and we are going to um Select from the geog data where that address is the address Let's move this in a bit And um, and so we're going to do a select and pull out that address And uh, the idea is is if it's already in the database, we don't want to do it So we do a fetch one and pull out that first thing which is the that will be the json right there If we get that we'll continue up. Otherwise, we'll keep keep going Pass just means don't blow up. So we accept and we just do a pass. That's like a no op and um, we're going to make a dictionary because that's what we do for the The key value pairs everything you've seen so far. I've used constants here But because we may or may not have an api key query equals and then that's the address and then the key equals and then the api key If you recall url encode adds the pluses and question marks and all that nice stuff We're going to retrieve it. I'm going to read it and decode it print out how much data we've got And add account and then we're going to try to parse that json data and print it if something goes wrong And as we've seen that at this top level of this json data from this geocoding api is an object Which we'll see a little bit of it in a bit And it has a status field in it and the status is okay if things went well um So if the status is not there that means our javascript is not well formed or not how we expect it If the status is not okay or not equal to zero results then print out failure to retrieve and then then quit And then we're simply going to insert this new data that we just put in And then we're going to commit it and every tenth one. This is count mod 10 We're going to pause for five seconds And we can hit ctrl c here and then we're going to play the do the geodump. Okay, so let's just run this geodata python So let's do an ls So we don't have oh, we do have let's get rid of from a previous test geodata.sqlite. So we'll start with a fresh A fresh set of data and run python geoload.py Of course, i'm always forever making the mistake if we're getting python 3 So you can see that it's running and it's adding the query and in this case I don't have the api key and it's putting the pluses in and that's this part here with all the pluses That's the url and code and you notice it's pausing a bit now depends on how fast your net connection This may or may not go so fast, but this is not that much data. So it should it's like only 2000 3000 characters And so it's working and talking to my My server and the interesting thing here is I can blow this up. I'm going to hit ctrl c And windows you'd hit ctrl and linux you'd hit ctrl c and then windows I think you'd hit ctrl z depending on what shell you're working in but i'm going to hit ctrl c And you see I sort of blew it up right and that's it causes a trace back a keyboard keyboard interrupt trace back I do an ls minus l You can see that now this geodata is there now in the in the name of restarting. I will restart this And you will see that it checks and skips and so all it runs this code here where it's Right here it grabs it and finds it in the database So you'll see it say found in the database really quick chop chop chop and go really fast And then it'll go back to catching up where it left off And so all those up there they did not actually re retrieve it because it knew about those things And so now it's catching up and doing some more and doing some more and doing some more And then I'll hit ctrl c it has a little counter in here that basically if it hits 200 it stops and you have to restart it You could obviously change this code. You could make it so it didn't sleep Doesn't hurt to sleep for like a second after every 100 or so if you want you could change that code um, and now uh, let's just hit ctrl c And blow it up ls minus l Um, and there is another bit of code and this code it's always good to write these really simple things And so we're gonna now we're going to import sqlite and json. We're going to connect ourselves up We're going to uh open Except this is a utf 8 because it's a ut we're going to open this with utf 8 and um We're going to read through and in this case. We are going to um Decode We did select star from locations and if you recall locations has a A location and a geodata And so the sub zero will be the location and the sub one will be the The geodata and we're going to parse it Convert it to a string and then parse it If something goes wrong with the json, we'll just keep skipping it We're a check to see if we have the status in our json um Let me run the sqlite browser here now open database. Let's take a look at what's in this database Oh Where are we code 3? geodata geodata sqlite So this is our the data we've got So if you make this a little bigger if I can can I make that bigger? Yeah, it's not going to show us much So you can see that these are the addresses in the geodata. That's just the json So that's the json that we got and it retrieves it and so this is a really simple database That's just a sort of spidering process run run run But now we're going to run the geodump code which is going to read this and dump this stuff out and print where.js So it's going to actually parse this stuff and that's code. We've seen before um So we're actually reading it and this line goes into the results It's the results is an array. So if we go into results results in an array We're going to go grab the zeroth item in that array And then we're going to go find geometry And then location and then lat and long for the latitude and longitude And then we're also going to take the actual address out of the formatted address Right here. So in this In this bit of code, we're actually parsing the json and we're going to um clean things up Get rid of some single quotes. This kind of data cleaning is just stuff after you play with it for a while You realize oh my data is ugly or does this i'm going to print it out And then i'm going to write this out and i'm going to write it into a javascript file and so the javascript file is this where.js and um This i'll show you what it looks like. It's going to be overwritten This is the one that came out of the zip file. It'll have the latitude the longitude and we're going to use Javascript to read this in this where.html file It's going to actually read this right there and pull that data in and that's how we're going to visualize I'm not going to go into great detail on how the visualization happens But that's what's happening and so we're going to write that so we're going to actually write this to a file So let's go ahead and run this code and say python 3 geodump Okay, so it wrote 120 records to where.js So if we look at where.js, this is now the new data that I just downloaded moments ago And it says open where.html in a browser Now this will you'll need the google maps api And you might not be able to see this depending on where you're at but here you go with uh Google maps locations, and I think if you hover over this You can see and you see the utf why we're there in that particular thing Why we had to use the utf8 when we wrote the file so that we didn't end up with trouble writing the file out And so there you go. And so that is a simple visualization and Just a simple visualization it wrote this where.js If you are smart with html and javascript you can you can look at this where.html file It's really just reading through a bunch of data and putting the points. That's that's all there is but i'm not going to to go through that so At least not in this and so I hope that this was useful to you and Thanks for watching So now we're going to write a search engine do some of the things we're going to do page rank and we're going to Visualize it in a in a web browser and show the weights We're really only going to do page rank on one page because you want to have links that More than one page that points to this to a page so that you can figure out which pages are more or less important And then we visualize it. We'll run the page rank algorithm and we'll separately do all this So at this point we're going to do pretty much the web crawling the index building and the searching We're not going to really search it. We're going to visualize the index But you could write a simple program to do searches for keywords and figure out which page was the most likely page for a keyword That would be a fun additional thing to do So the web crawler is this program that hits hits a page pulls down the html Parses the page looks for links makes a queue of incoming links that are as yet unretrieved And and I'm going to do this in a simple sqli database and starts out with The database basically starts with one link as the starting point And then it retrieves that page and then you see the database End up with lots of unretrieved pages and it goes back in and picks a random page and retrieves that one And then it just expands and expands This code that I've built that you're going to play with is only stays on one website Otherwise it would go crazy and but of course google doesn't use an sqli database running on your hard drive But you get you'll get the idea. You'll see this thing exponentially gain links And you'll run it for a while pull down a thousand web pages or whatever but of course Make sure that you Don't violate any terms conditions. And again, I've got some data sources that you can use and they're not rate limited But you can also use things like wikipedia Which I think they sort of discourage you or dr chuck.com which has no rate limit or or who knows what right? So so just be careful. Don't do this on facebook and don't do it on google. Don't get yourself in trouble and If you're using you know A internet connection where you're paying for bandwidth. Be careful So this is the idea of the web crawler and this isn't my picture. This is the classic picture of a web crawler Read a page parse it take all the URLs and stick them in a queue Crap again and again. So for us the scheduler is going to do it as long You'd say oh do a hundred pages or it runs until it blows up I mean and and again these processes that are have the network in the loop It's really important that they behave well when they blow up And that's why databases are so useful because you can be writing along to the database and some random thing happens and blows your Blows your data up and you start over. So you're reading these things. You're storing each page building up your storage Etc. Etc. So you just keep on doing that and with this program You you'll be able to retrieve some stuff then run the page rank Then you can retrieve the more and then you can run some more page rank and you can kind of see How google sort of evolves its index over time. Of course, we're we're so much simpler And like I said be careful when you crawl You're going to run a crawler that just goes as fast as it can But google doesn't do that. It's careful not to Overwhelm any websites. It's trying to be smart about the use of your bandwidth on your website There is a file Our code won't bother looking at this But there's a file called robots.txt that real web crawlers look at and it gives a list of the things you're in Are allowed to look at and not allowed to look at and so if you go to google and you see a search It says we're not allowed to show you The summary text of this page because of the robots.txt It's there and you can go and you can actually see A robots.txt Like on just go to any website. It's at the top root Blah blah blah blah slash robots.txt. Don't it's not a path. It's not slash this slash that slash something else robots It's at the very very top of a website The index building uses the page rank algorithm and the whole goal of page rank algorithm is to Figure out which pages have the Most best links. So having the most links is really easy. You can just say how many links go to this But the problem is is you got to figure out the value of those links And then you have to how do you figure the value of those links by looking at how many good links come to it So it turns out that it's a an infinite problem. It's an infinitely difficult problem to to use page rank But you can approximate it and what happens is after a while it converges To a reasonable value and so we're going to run the search index and Each time it runs you're going to see that it says, you know, how much did these numbers change and what happens is In the beginning they change very wildly, but quickly they flatten out and it has the the best way to think about the the page rank is Think about how water runs where You have a small little stream going by a house and sometimes it rains sometimes it's dry and sometimes And and there's like a little little lake And the stream is always running and it doesn't go up and it doesn't go down It might go up a little bit if it rains a lot But in general there's sort of a steady state meaning that whatever water's coming in is about the same as the water going out So we think about this in terms of web pages The the value of the links coming in is roughly the same as the value of links going out So when that starts to balance the in and the out value from each of the nodes, then You've got a pretty stable and so what google does is they have a really relatively stable assessment of goodness and value of pages And they use that to commute page rank and then they throw a few more pages in and it kind of has to adjust for a while But it reconverges and so this is a calculation that generally converges And it doesn't vary wildly and that's why it you know google is pretty good at Kind of arriving at the true value of something. So let's take a look at what we're going to do in this application Again, we have a a a file That is going to spider the web and we only have one database again in this one We'll have two databases in the next one And so this is spider is the restartable part and what we actually do is we put one url in the starting url And then spider walks in and asks are there any unretrieved pages And it does that randomly and sort of picks among the unretrieved pages and says, okay, great I'll go retrieve that page And then I'll parse that page and then I'll put in a bunch of new unretrieved pages Okay, as well as the text of that page and then a bunch of unretrieved pages And then it'll go back up and I'll say oh give me one of the randomly non retrieve pages And I'll grab the next page and pull that page down and then add to it And so this is like there's a page and then a to-do list And then this one becomes a page and then adds a few more things to the to-do list And so the to-do list or the the unretrieved urls grows very rapidly Um And the retrieved ones grow sort of as you retrieve them one at a time But you've always got this long list if you have a really short site that only has like two links If you start at dr chuck.com Slash page one dot htm it'll go to page two and then go back to page one and it'll be out of things It'll have retrieved all of the pages And so if you have a website that has no external links or has a very few pages and they point to each other This will run out of things to do But if you go to a page like my blog or the the code that I the the sample stuff that I have up for you Despite her for testing on dr. Chuck net It'll run for a very long time and you'll have far more Pages to retrieve than pages that you retrieve, but that's okay At some point you can stop this Maybe it stops because you ran out of bandwidth or maybe your computer went down or who knows what right? But it's okay This is a restartable process because it always has some pages that are retrieved and some unretrieved pages You start it back up it picks randomly from the unretrieved pages The database is the sort of persistent state of your spider rather than a bunch of dictionaries or lists inside the python Which go away when the program dies And uh, and so at some point you have let's just say a few hundred pages in here and a few thousand unretrieved pages You can run the page rank algorithm And what the page rank algorithm does is it loops through all the pages and figure out which pages are linked to which pages And then reads the numbers and then updates the numbers and then does that some number of times And so this is where the numbers all the pages sort of start out with goodness of one Uh, I think this printout is showing that goodness of one and then it changes and then the goodness goes to The some of the goodness goes up to two some of the goes it goes to seven and whatever But then it does this over and over and then it uses these numbers and then they change again And so there's a number of time steps that this page rank runs And you will see as the page rank runs when we show you the code you'll see the average sort of Change in these numbers across all these things and you'll see that it the average goes down very rapidly As you get through and so usually with a few hundred or even thousand pages like a hundred Plus times during this algorithm and these numbers have converged And that's when you sort of can begin to trust the numbers now There's this one program called sp reset which sets all the pages back to one so you can start this over So if you were to spider for a while run sp rank for a while play around And then you wanted to spider some more and start it over you could say oh, let's start the page rank completely over Or you could simply take the new pages and and and and watch it adapt it either way This is just a way to reset all the pages to have sort of their initial value of a goodness of 1.0 So at some point you run this this this runs really this part here runs really slow This part runs super fast like in the blink of an eye. This one is pretty fast Um, and then at some point you've got these Pages that have you know numbers on them. They have values on the pages And there's a couple of programs that allow us to visualize that one is the dump Which just reads it and checks to see it shows the the new page rank the old page rank Um, and various other things and shows just a way to dump it And then there's this thing that reads the whole thing You say I'd like to do 25 the top the best it sorts it by page rank and then produces a javascript file It has just the the numbers in it. And then there is some html and a visualization library called d3.js Which you can read about that when the html starts it reads this and has this nice force-directed layout of the page rank and you can hover over things and you can see What page rank you've got And so and so that is the page rank algorithm that we're going to do and up next we'll do the Largest and most complex of these things and that is the email. We're going to spider some email Which is about a gigabyte of data. Okay We're doing a bit of code walkthrough and if you want to you can get to the sample code and download it all So that you can walk through the code yourself What we're walking through today is the page rank code. And so the page rank code Um, let me get the picture of the page rank code up here Here's the picture of the page rank code and so the page rank code is a Has four chunks of code that are going to five chunks of code that are going to run The first one we're going to look at is the spidering code and then we'll do a separate look at these other guys Later, so the first one we'll look at is spidering and again. It's sort of the same pattern of we've got some stuff on the web In this case web pages. We're going to we're going to have a database that sort of just captures the stuff It's not really trying to be particular intelligent But it is going to parse these with beautiful soup and add things to the database Okay And so then we'll talk about how we run the page rank algorithm and then how we visualize the page rank algorithm In a bit now the first thing to notice is that I've got to put I I put the beautiful soup code in right here Okay, so this is you can get this from the bs4.zip file There might need to be a read me. No, but there's a read me somewhere But you got to get used beautiful soup You got to put this bs4 zip or you have to install beautiful soup for your stuff So I provide this bs4.zip as a quick and dirty way if you can't install Something for every for all of the python users on your system So that's what it's supposed to look like you're supposed to have it unzipped right here in these files And I don't know what dammit.py means that came from beautiful soup If you look it's in their source code. So I didn't I'm not swearing it's beautiful soup people are swearing. I'm sorry. I apologize Okay, so the code we're going to play with the most is in this first one is called uh spider.py And you know, we're going to do databases. We're going to read URLs and we're going to parse them with beautiful soup Okay And so what we're going to do is we're going to make a file Again, this will make spider.sql light and if you're here, we are in page rank and else minus l spider.sql light is not there. So it's going to create the database. We do create table if not exists We're going to have an integer primary key because we're going to do foreign keys here We're going to have a url and the you're going to the url Which is uh unique the html which is unique whether we got an error And then for the second half when we start doing page rank We're going to have old rank and new rank because the way page rank works is it Takes the old rank computes the new rank and then replaces the new rank with the old rank and then does it over and over again and then we're going to have a A many to many table which points really back. So I call this from id and to id We did this with some of the twitter stuff Um, and then this webs is just in case I have more than one web, but that really doesn't make much difference okay, so What we're going to do is um We're going to select id url from pages where html is null. This is our indicator that a page has not yet been retrieved And error is null ordered by random And so this is our way this long bit of stuff and this not all this sql is completely standard But this order by random is really quite nice in sql light limit ones is just randomly pick a record in this database where this true is true And then pick it randomly And then we're going to fetch a row and if that row is none right We're going to ask for a new web a starting url And this is going to fire things up and we're going to insert This new url. Otherwise, we're going to restart we we have a row to start with And otherwise we're going to sort of prime this by inserting the url We start with And insert into it if you enter it just goes to dr chuck.com which is a fine place to start And then what we do is we This this this what this does is it's page rank is it use this this webs table to Limit the links it only does links to the sites that you tell it to do links And probably the best for your page rank is to stick with one site Otherwise, you will just never find the same site again if you let this wander the web aimlessly And so I generally run with one one web which web this should be probably called web sites And I pull in all the data and I read this in and I just make myself a list of the url the legit urls And you'll see how we use that and the webs is how many what are the legit places we're going to go because we're going to Go through a loop Ask for how many pages? And we're going to look for a null page again. We're using that random order by random limit one and then we're going to have a We're going to Grab one we're going to get the from id which is the page we're linking from And then the url Otherwise there's no one retrieved and so the from id is when we start adding links To our our page links. We got to know the page we started with and that's the primary key We'll see how that primary key is set in a second So otherwise we have none and we're going to print this from id the from id and the url that we're working with Just to make sure we're going to wipe out All of the links Because it's unretrieved we're going to wipe out from the links the links is the connection table that connects From pages back to pages and so we're going to wipe out So we're going to go grab this url We're going to Read it We're not decoding it because we're using uh, we're using beautiful soup which con We're using beautiful soup which compensates for the utfa and so We can ask this is the html error code and we checked 200 is a good error And if we get a bad error, we're going to say this error on page We're going to set that error. We're going to update pages that way We don't retrieve it ever again We basically check to see if the content type Is uh text html remember an http you get the content type We only want to retrieve we only want to look for the links on html pages And so we wipe that guy out if we get a jpeg or something like that We're not going to retreat jpeg And then we commit and continue and so these are kind of like oh those were pages We didn't want to mess with and then we print out how many characters we got and parse it We do this whole thing in a try accept block because a lot of things can go wrong here It's a bit of a long try accept block Um keyboard interrupt. That's what happens if I hit control c uh at my keyboard or control z on windows um some other exception probably means beautiful soup blew up or something else blew up and so Uh, we we indicate with the error equals negative one For that url. So we don't retrieve it again At this point at line 103 We have got the html for that url And so we're going to insert it in and we're going to set the page rank to one So the way page rank works is it gives all the pages some Normal value and then it then it alters that we'll see that in a bit. So it sets it in with one We're going to insert Insert or ignore that's just in case this pages is already at the pages is not there, but Um, and then we're going to do an update and that's kind of doing the same thing twice Just sort of doubly making sure if it's already there This or ignore will cause this to do nothing and the update will cause us to retain it And then we commit it so that if we do selects later, uh, we get that information Now this code is Similar remember we use beautiful soup to pull out all the anchor tags We have a for loop. We pull out the href And you'll see this code's a little Little more complex than some of the earlier stuff because it has to deal with the real nastiness or imperfection of the web and so We're going to use url parse, which is actually part of the url lib code Um, and that's going to break the url into pieces come back We use url parse. We have the scheme, which is http or https Um, if it's a relevant this solves relative references This is solves relative references by taking the current url and hooking it up url join knows about slashes and all those other things We check to see if there's an anchor the pound sign at the end of a url And we throw everything past including the anchor away If we have a jpeg or a png or a gif we're going to skip it We don't want to bother with that these we're looking through links now We're looking at all the links And if we have a slash at the end, we're going to chop off the slash by saying minus one And so this is just kind of nasty choppage And throwing away the urls that we're going through a page and we have a bunch that we don't like or we have to clean them up or whatever And now and we've made them absolute by doing this. It's an absolute url This is just you write this slowly But surely when your code blows up and you start it over and start it over and start over Check then what we do is we check to see through all the webs Remember those were the urls that we're willing to stay with and usually it's just one If this would link off the sites of the sites we're interested in we're going to skip it We are not interested in links that leave the site So this is like link that left the site skip it But now we finally here at page line 132 We are ready to put this into pages url and the html and it's all it's all good, right? and Where so that one's that one's going to be null right there because we haven't retrieved the html This is null because this is a page. We're going to retrieve We're giving a page rank of one and we're giving it no html and that way it'll be retrieved and then we commit that Okay And then we want to get the id so we could have done this with One way or another but we're going to do a select to say hey What was the id that either was already there or Was just created and we grabbed that with a fetch one And say retrieve to id and now we're going to put a link in insert or ignore into links from id to id which is The the id the primary key of the page that we're going through and looking for links Two id is the link that we just created and a way we run So it's going to go and go and go and go If you let's go look at the create create statement up here From id and to id right there. Okay, so so let's run it Python three oops Python three spider python So it's fresh and so it wants a url with which to start and I'll just start with my favorite website www.doctorchuck.com Now this this basically this first one you put in it's going to stay on this website for a while Okay, so I'll hit enter and let's just grab like Let's grab one page just for yucks. Okay, so it grabbed that and um it printed out that it got 85 45 characters and it printed out that it got Six links so if I go to this and Open database And I go to Code three and I go to page rank and I look at this Oh Let me get out so it closes so here So notice this sqlite journal that means it's not done closing So I'm going to get out of this by pressing enter And so you'll notice now that that journal file went away. Otherwise, we would not be getting the final data. There we go Okay, so webs, let's take a look at the data webs Has just one URL. That's the URLs that we're allowing ourselves to look at You can put more than one in here if you want, but most people will just leave this as one pages So we got this first one and we retrieved this and this is the html of it and we found Six other URLs in there that are dr. Chuck com URLs, right? There was lots of other URLs in there But there were only five other ones That uh that we found okay, and so and and what we'll find is if we go to links We'll see that page one links to two links to three links to four links to five links to six because the links is just a Many to many table so page one points to page two Page one points to page two page one to three page one to five. Okay, so that's what happens when we have The first page so let's retrieve one more page now it's we could have Started a new crawl, but we're just gonna it's gonna stay on dr. Chuck comm and I'll just ask for one more page And so now it went and grabbed it randomly picked among these null guys and I'm gonna hit enter to close it and then I'll refresh this And oh so it looks like we retrieved obi sample and we didn't get any new links And so the links page. No, we didn't get any new links. So that page whatever that was obi sample had no external links. So let's do another one one more page So that one had 15 links. So let's take a look now so now we have 15 pages it picked this one to do right and now it added 15 more pages and then if you look at links you will see that Page four which is one it just retrieved Links back to page one so now we're seeing this is where the page rank is going to be cool four links to one four links to Whatever a way we go, right? One goes to four four goes to one. I should have probably put a uniqueness constraint on that. It's not supposed to That duplicated that Okay So let's run this a bunch of times now. So let's run. Uh, let's just run it a hundred times For a hundred pages It'll take a minute So you'll see it's like freaking out on certain pages and not parsing them, you know, it's fun and it's way into my blog um It's finding like 27 links this this table is growing wildly at this point It's going to take us a while before we get to 100. It's kind of slow Now the interesting thing is I can hit control c at any point in time Right and so that blew up But it's okay because the data is still there and so if we go back to pages for example and we refresh our data We see we got a ton of stuff and this will restart and all the things So if we search this that I sorted that by html You see that there's lots of files that we've got and it's never going to retrieve those again because those have html So then I can run this thing again And start it up And when I say control c your your computer might go down your network might go down There's all kinds of all kinds of things that might happen and you just pick up where it leaves off It just picks up where it leaves off and that's what's nice about this Okay so That's pretty much how this works Uh, we're we're we've got this part running. We're seeing it flow into spider to sql light We're seeing that we can start this and replace this And so what I'll do is I will come back in the next video and show you How all these things work together and then how we actually do the page rank So thanks again for listening and see you in the next video We're picking up in the middle here where we are running, uh simple spider that's retrieving data and putting it into Running this spider. Uh, a py file and it's cruising around and Doing things and and the beauty of any of these spider processes is I can stop anytime and just hit control c and uh, and so we take a look at the, uh spider sql light file And retrieve it and it looks like we got 302 pages. I don't know how many we got retrieved 70 okay, there we go We've got about, uh 100 Oh wait, I'm looking for the wrong thing. No, no, no, no, no No Yeah, we got about 107 pages So what we're going to do now with 107 pages is we are going to, um Run the page rank algorithm. Okay, so let's take a look at that code So so the idea of page rank um We're going to run this page rank algorithm The sp reset just resets the page rank and sp rank runs as many iterations of page rank So the the basic idea is that if you were to look at the links here, um You know We think of page one pointing to page two gives some of page ones loved of page two page page Page four has some value that it gives to page one You go on and page two has gives love to page 46 Over and over and over again And so but the problem is is that how good is page one and how much positive Karma does it give to page two? And so what happens is is we we we start by giving every page Uh a rank of one we say like everybody starts out equal But then what we do is we divide up in one iteration of the page rank algorithm Then we divide up the goodness of a page across its outbound links and then Accumulate that and that becomes the next rank. Okay, and so So let's take a look at the code for uh for the page rank algorithm So this is pretty simple. It only imports sql i3 because it's really doing everything in the database, right? It's going to it's going to be updating these columns right here in the database and so So we're going to do some things here to speed this up Um, this rank runs if you're thinking of google this rank runs slowly and it's going to run Continuously to keep updating these things So the first thing I do is I read in all of the from id's from the links Select distinct throws out any duplicates um, and and so I have all the The the from id's which are the all the pages that have links to other pages because uh all the pages are in pages But in links to be and have a from id you have to also have a to id Okay, and so we're also going to look at Uh the pages that receive page rank and we're kind of pre caching this stuff Okay, and so we're going to do a select distinct of from id and to id and loop through that group of things and um We're not going to we're making a links list here And so we're saying if the from id is the same as the to id we're not interested If the from id is not already in my from id's that I've got I'm going to skip it if the to id is not in the from id meaning that this is a to id that's not Also, we're not I don't we don't want links that point off to nowhere Or point to pages that we haven't retrieved yet and that's what this is saying So this is really going to give us it's a filter on the from id's and the to id's From the links table so that it only are the links that point to another page. We've already retrieved And then we're going to keep track of the entire superset of the two id's the destination id's And i'm just putting these all in lists so that I don't have to hit the database so hard Okay, so this is getting what's called the strongly connected component meaning that any of these id's there is a path From every id to every other id eventually. So that's called the strongly connected component in graph theory Then what we're going to do is we're going to grab the we're going to select new rank from pages Where for all the from id's right and so we're going to have a dictionary that's based on the Id the primary key That's what node is equals the rank and so if we look at our database That means that for the the part of the strongly connected component in links We're going to grab this number and stick it into a dictionary based on the primary key of this Based on the primary key this number right here So we're going to have a dictionary that's this map to that again. We want to do this as fast as possible Now we're only doing one iteration at the beginning. So it asks how many times you want to run it okay And so we just take it make an integer of that we check to see if there's any any values in there if there are no values we are bad And now we're going to go i equals one to range many this is going to be one to one so it might run however many times And then what it's going to do is it's going to compute the new page ranks and And so what it's really going to do is it's going to take the page rank the previous ranks and loop through them and uh This is the previous ranks is the mapping of primary key to old page rank Okay, and for each node We're going to have total equals total plus old rank and then we're going to set the next ranks to be zero Okay, and then what we're going to do is figure out the number of outbound links for each page rank item So node and old rank in the list of the previous ranks These are the ids we're going to give it to and so For this particular node We're going to have the outbound links and we're going to go through the links And not linked to itself although we we made sure that doesn't happen We make sure that this but then we're going to make a list called give ids which are the ids that node Is going to share its goodness And now what we're going to do is we're going to say how much goodness are we going to flow outbound Based on our previous rank of this particular node and the number of outbound links we have So that's how many Time that's how much we're going to give in our outbound links And then what we're doing is all the ids we're giving it to We started with the next ranks being zero for these folks These these are the the receiving end and we're going to add the amount of page rank to each one So whatever this is so we'll go through All of the links Give out fractional bits of our current goodness And it's accumulated in each one and so eventually All the incoming links will have been have granted each new link value, okay Now i'm just going to run through and calculate the new total um and And this is this evaporation. The idea is is that um You can't You can't it has to do with the page rank algorithm that you there are dysfunctional shapes in which page rank can be trapped and this evaporation is Taking a fraction away from everyone and giving it back to everybody else And so we add this evaporative factor and Then we're going to Do some computations just to show some stuff and that is we're calculating the uh difference The average difference between the page ranks and you'll see this when I start running it and that is telling us this is going to tell us the stability Of the page rank So from one iteration to the next The more it changes the least stable it is and you'll see in a sec that these things stabilize And then we're and we'd say what's the average difference in the page ranks per node, which is what this is And that's we're going to print and now we're going to take the new ranks and make them the old ranks and then Run the loop again. So i'm not actually updating the database each time through the page rank iteration But then at the very end I am going to do the update for all of these things and update all of the rankings Start to with a new rank. So i'm doing an in-memory calculation So that this runs this loop here runs scrammingly fast Even if I want to do like this loop a hundred times or a thousand times. It's really all just in-memory data structures Okay, so it's probably easier just for me to show you this it'll be very uh The code runs quite simply python three Prank rank Got py and so i'm only going to run it for one iteration and that means that it's going to loop here Is just going to run one time and so it's going to start with The page ranks The new rank of one and it's going to just run one iteration and put the rank there Okay, and then update this as well. So let's go ahead and run that once for one iteration Okay, and so it ran one iteration and the average change Between the previous rank and the new rank is one. So there it's actually quite crazy So i'm going to refresh here and you'll see that the old rank was one And the new rank Is went way down way down way down way down a little bit Down down some up a whole bunch Down down up so you see that they went down and up now the sum Of all of these numbers is going to be the same right because all it did it was like Floated out and and recalculated it And so that's what happens with page rank And so what'll happen is if I run one more page rank iteration This number will these numbers will be used to compute the new new rank And then these will be calculated to the old rank And so you'll see that these will get they will change again. So i'll just run it one more time So i'm gonna run one iteration And then we'll hit refresh So you see all these numbers got copied over but now there's a new rank that's computed based on These guys and so they're getting this one went up. This was point one three. That's gone up a little bit This one's gone up some more. This one's gone up. This one went down Right. So this one went down from six to eight And you can see that the the difference is now the average difference between This number and this number across all of them Went from one point something to point four one and you'll see that with these very few pages this page rank Converges converges really quickly. Okay So let's run it again, and i'll just run 10 and you will watch how this converges Okay, so there you go. It converges So and and you're seeing now after like 12 iterations that the difference between the old rank and the new rank Well, that's because it's that old rank. I'll run one more iteration so that you can see So this old rank is less than you know point 005 So now you can see that these numbers are sort of stabilizing This is the average that 005 number is the average difference between these two things Now if we're going to pretend to be google for a moment, we can say python three spider Dot py so and so in So let's just do 10 more pages now. What's going to happen here is these new pages Are going to have page ranks of one Okay So let's get out So if I do a refresh now and I look at new rank So there's these guys that have high rank what you'll see I hope if we yeah, okay. So you see new pages, right? These are the new ones that we just retrieved I don't know if they're linked or not and they all got one so some some old pages are way up 14 Some pages if we go downwards are way down, right? So these are like useless pages They you know, they point to somewhere but nobody points to them. That's what happens with these page ranks Okay, so what happens is is the new records get this Point one and so if I run the ranking code again And I run Let's just run five iterations You'll see that the average delta goes up just briefly as it sort of assimilates these new pages And then it goes right back down again. And so that's what's happening with google It's sort of running the spider to get more pages then running the page rank Which gets disturbed a little bit, but then it reconverges very rapidly And of course, they've got billions of pages and we've got hundreds of pages, but it's you get the idea Okay, and so I can run I can run its page rank like a hundred times And after a while it just sort of hardly is changing. So that's 2.7 to the negative tenth power. So now You know, let me run it one more time to update the stuff And if I refresh this you're going to see look at the look at how stable these numbers are 14 9 4 3 5 9 1 5 6 7 the difference is there in the seventh one So that's why this whole page rank is really cool. It seems like it's really chaotic when it first starts out and a way you go okay, so That was just this sp rank right sp rank Uh and sp reset we can look at that code. I won't bother running it It just sets the old rank to 1. That's it. That's as much code as you've got. It just Starts it and lets it rerun. So i'm gonna stop now And i'm going to start a new video where I should talk about this phase here where we're actually going to visualize the page ranked data And what we are in the middle of is we're in the middle of the page rank code And we just got done running the page rank. And so We have spidered the code We've run page rank a bunch of times sp reset allows us to restart the page rank algorithm if we want But we're not going to play with that. We're just going to play with sp dump and spj sun and do the visualization Which is the fun part. So I'll go into sp dump So this is a simple code because it's really just running a sql query and then printing stuff out right So we connect to our database create a cursor and then just do a select count and we're going to We're going to just show the number of links We're going to order by the number of inbound links descending So we see the most linked things and we'll see the top 50 that so this is just a sample You'll tend to write little helpers like this That make your life easier just to show you the kinds of things that you want sp dump Py and like you just kind of test to make sure that's like oh this looks right to me You know and so here is the number of inbound links So that's my blog that has the most inbound links followed by my uncategorized whatever that is And these are the number of inbound inbound links Within my own blog somehow. I don't know because this is not looking at the whole Whole internet at all. So so there we go. So that's sp dump pretty straightforward Um, and now we're going to go through the visualization visualization process And so this is going to look at all that data and produce Some file a javascript file. It's going to write a javascript file that will then be fed into my visualization using d3 and sp json Is um going to do a big long Join it joins the links with the thing and where html is not null and error is not null You know order by the number of inbound links. So we're looking at these the things that have the highest number of inbound links We're going to uh read all this stuff We're going to read through all those rows And um pull out the the page rank for each one We are looking for the highest and lowest rank because these numbers can vary quite widely They go all the way from you know 0.000 to 20 or 30 And so um And so we it asks how many do you want to do so it only does the top like 20 or something and you'll see why we need that in the visualization And uh, and so this is just checking and so we're going to write out a file We'll see what the format of this is it's just a little j. It's just a javascript file And we're going to uh write out We're we're basically normalizing the rank. We're subtracting the minimum rank and um Because we're going to turn this into line weight the thickness of the line. And so we're dividing by You know the we're normalizing the rank to be the thickness of the line and the size of the um Uh the size of the the ball you'll see all this and so this is really just writing some javascript with the little strings and stuff like that Um, and then we're going to finish the javascript And then we're going to write all the links out. So these are the balls that you'll see And this is showing what this is drawing all the lines and this is again normalizing things for thickness And printing these things out now. I don't want to go through this in tremendous detail. But so i'll do python sp json dot py Let's do the top 20 nodes And if I take a look at this file spider.js You can see that it's some objects that basically, uh Put the page rank in which idea it is and that's a way for me to be able to link back and forth Weight is how big the little circle is And then I have the links and I only asked for the top 20 Right and then this is the uh the the thickness of the line where the line starts where the line ends Okay, so This is read By this html file and um It's going to read Somewhere this force js file And um my own spider dot js code. This is some javascript I mean no the force js is the the visualization code and this is d3 the visualization library So this i'm using this d3.js, which is a really great visualization library And this is just drawing the circles and making the circles of colors and making the circles bigger and smaller And then connecting all the lines in between it. So this is just there This data feeds that thing and so when we're all done you simply say open you don't have to do anything open force dot html And so this all this beautiful javascript stuff is like oh, wow, that's really cool because you can move these things around whoa You can see the uh circles are bigger if you hover over it for a while. It shows you the big ones Um, you know, you can see these things and it's kind of cool So I gave you all this force dot js and force dot html And so that kind of visualizes the page rank and you could um, you know use this to visualize Uh quite a bit of stuff. I um, you know It'll take you a while to pull down enough data from a real website But after you pull down four or five hundred pages if you have some time and then the visualization is uh quite interesting But you can see why we had to pull down several hundred pages just to get this much page rank information Okay, so uh that That gives you a sense of how to run the page rank code in uh, python for everybody So thanks for us. Thanks for listening The last visualization application that we're going to take a look at is mailing lists And that's kind of ironic we started with the mailing lists And we're going to end with the mailing list the mailing lists, of course are from my open source Sakai project Which I love and very proud of and and so uh, what we're going to do is we're going to crawl the archive of a mailing list And then we're going to do two visualizations. One is a activity visualization and another is a word cloud. So, um It's probably probably the more important thing is when I do the demonstration of how the software works So this is a large data set. So you got to be careful This could spider gmain.org, which is a very free and friendly archive. This data originally came from g gmain.org But I've got a copy of it. And so gmain.org is not rate limited But if everyone who is watching this starts spidering gmain.org at the same time you will crash it It just doesn't have the horsepower to give you this data as fast And so I've got something that can give you the data super fast and has no rate limited on really good server And it's Cashed all around the world using a technology called cloud flare. So please please please Don't point this at gmain.org point this at the url here at mboxdoctorchuck.net Etc etc and then you can run this as fast as you like now another thing to worry about is if you Have a metered connection. So don't do this on a cell phone connection because you'll pay Thousands of dollars perhaps make sure you're running no cost connection Before you start running this because this is going to pull a lot of data down If you just start this from scratch and you let it run it it On a super fast connection it the whole downloading the whole thing is probably about four hours on a On my home connection When I had like about a 10 megabit connection it took several days And so so just understand that in this one It's both fun to deal with a ton of data And it's scary to deal with a ton of data. So this one is big this one is You'll see the process in action because it'll run for a while Everything the things will take a long time. So here's basically the flow of the data in this particular one You are going to have the restartable spider that talks to the api Mbox chuck doctor chuck.net which has a scalable copy of all this information And again, it's going to do kind of a raw database Not a very clean database. It's sort of a mess It's just just enough columns to keep track of whether or not we've got this page or not And so so this has you know, the ones we've retrieved so far And so what gmain does is it sort of scans down to see where to retrieve next gets that And then start scanning and then adding things here So it just adds it and then it blows up and then it comes in again and says, okay I'll start here and then it starts retrieving stuff and fills this in fills this in fills this in And sometimes you put like a delay in this so you don't overwhelm networks or don't overwhelm servers But basically this is pretty much a raw Retrieval of the email messages and this file can get rather large. This is the one that's greater than a gigabyte Now this data is actually really nasty. It's email data. The date format's changed This is data that lasted from 2004 to like 2012 or 13 And so this just data is got a lot of things wrong with it It even has things where people's email address has changed and so it has this mapping file This comes along with it this mapping file. It says Here's this one person and here are the six email addresses that they used throughout the life of the project and so There is a relatively complex and so this is this part here is super slow Very slow This part here is slow But it'll take like Depending on how fast your computer is somewhere between two minutes and ten minutes This will this first this first part will take days perhaps depending on the speed of your network connection And so what g-model does is it reads through this it actually Recreates it wipes this out and recreates index dot sqlite every time it runs So that you can change any number of things you can respite your things you can do whatever And often the cleanup this is one of those cleanup processes Then you have to tweak the cleanup process you like look at your data like oh the cleanup missed something So i've got to run it again So this produces index dot sqlite every time it runs. So this is like two to ten minutes g-model is two to ten minutes and it like maps names and when it's all said and done This is a very small highly normalized It's a nice data model. This one here had the content sqlite has a ugly data model Index dot sqlite has a pretty data model. It's got form keys It's got all this stuff and all those things we talked about in the database where It's efficient and so in your mind Keep track of how fast it is to scan all the data in a database with a bad model And then watch when you run like g basic which is a scanner or g line Which produces line date or g word and watch how fast they run they run in like Couple of seconds at the most and this runs in two to ten minutes And that and the difference is is that's because the data is efficiently modeled in index dot sqlite So you can take a look at that using sqlite browser and take a look at the data model And you'll see it looks just like the stuff we talked about in the database chapter It's got foreign keys and and all those things And so that runs and you've got this and then we do our visualizations And our analysis from this clean version of all the data and so g basic just loops through and prints some stuff out It's a great way to test things It's a pretty easy to understand program and you could take a look at it G line does some bucketing and makes a hit some histograms to produce a line graph And then g word does a different histogram It does a histogram of word frequency and then produces that as the word frequency ends up in g word.js And then we have two html files that use the d3.js Visualization to produce a line and a word chart And so you know i'll in another video i will show you how this code works, which is probably more useful than this picture but this is whole bunch of Good stuff in this particular application and and if you really understand everything in here you can build a pretty sophisticated data retrieval and analysis pipeline And so so that's it. I thank you for watching all these lectures and look forward to seeing you on the net We're doing some code walkthroughs if you want to get the source code you can take a look at the sample code and download it and and work through it and so what we're working on now is Doing some retrieval and visualization of email data. It's kind of ironic We are going to now look at the email data that we Look at the email data that that we started with it's the same Sakai developer list email data and so There's this service called g main and g main archives developer lists and various email lists and i've made a copy of their data because All the students in my class hitting the same their server with their api would crush it So in order to be a nice guy, I put up a much more powerful server with just the data from From this one list and it's about a gigabyte of data. So be real careful if you're paying for network So the basic process we're going to go through is we're going to have a spidering process. That's a simple restartable focused on the network problems Data data pulling to pull content. sql light and there's going to be a database there And then we're going to have a cleanup process This database is going to get large about a gigabyte and then we're going to have a process that takes It kind of grinds through this data. It takes a while and um And so then it's going to read this mapping and I'll show you that when it comes because things like people's names have changed Over all these years and it does a cleanup and makes a really nice highly relational Version of this data and then we visualize from here And so this this could take you several days to finish this This will take like a few minutes to run and then this will just take seconds to run. And so this is a multi-step process where If you were doing something like running something for two days to produce a visualization And it blew up three cars the way through It would do, you know, good. And so that's why we break this into a simple parts But right now we're just going to focus on this part right here and uh and take a look at the mail bit And you know the mail bit and retrieve the mail and then We'll we'll have another video to talk about the rest of this stuff. Okay, so Let's take a look at the code So here is gmain.py. That is the basic code and it's Hopefully the stuff is look starting to look familiar The thing that's weird here is we've got to do some date time parsing and there is code that's out there But you may have to install it and I had to write my code in a way that Didn't assume that you could install the date time parser And so it has it even if that's not there it uses my own date time parser And that's what this code is. Don't worry too much about that um And of course we have to deal with the lack of certificates inside of python And uh and so we start things out And this is really a simple table. We've got a messages table. That's got a primary key The email itself when it was sent what the subject and the headers and the body okay, uh, and so What we're going to do is because we have to pick up where we left off. We're going to uh select the largest primary key from the messages table and retrieve that And uh, and then we're going to go to the one after that. Okay And so we we know what the id is And we're going to pick up where we left off And so we have a starting point and starts either zero or one And we're going to ask how many messages to retrieve We've got some counters And so we're going to say okay See if select id for messages where id equals whatever that starting is it's the highest number we've seen so far and if we uh If we've got if we if row is not none that means we've already retrieved this particular email message Otherwise we're going to keep on going and we're in good shape and this is one that we want to retrieve And we're subtracting that so we don't and so this is the base url. This is the um This is the url of our api the one that i copied my i have a nice copy of all this data On a server that's accessible worldwide and won't crash So the format of this is you can say i would like the email address for one from one to two or from 100 oops from You know 102 101 message 101 to 102 and we can just kind of walk through these things So that's the message id And so if we're going to make the url We're going to take the base url add the starting address and then add plus one So we got the slash at the end of this starting address. And so that's how we form those And We're going to retrieve that and we're going to decode it We've seen this in some other ones. We're going to check to see if we got legit data If not if we didn't if like a 404 not found or something else we're going to quit Uh, if someone hits control c which is our control z we'll get the program interrupt and we'll stop Um, if there's some other problem Right We'll uh, we're going to you know Complain and keep going and if we have five failures in or we're going to quit But it will just keep on going because these things do have glitchy bits here And so at this point if we've made it this far we've we've retrieved the url and we've got the number of characters we've retrieved and If we get bad data if it doesn't start with from because this is a mail message Right and they all start with from space if it's right it starts with from space um Then what we're going to we're going to Tolerate up to five failures there for bad data because it could be bad And then we're going to find a blank line because that's the new line at the end of one line and then a blank line And then we're going to take and break this into the headers the mail headers Which is that mail headers is this stuff right here Up to but not including the blank line and then the body is everything after that okay And um, and so we'll just have break that into pieces Otherwise we'll complain and tolerate up to five characters And then we're going to use a regular expression kind of from the regular expressions chapter to pull out an email address from the From colon lines somewhere in these headers from colon right there It's going to go find a less than and then pull oops come on Pull this stuff out up to it. So you got the less than you got the parentheses You got one or more non-blank characters followed by an outside followed by one or more non-blank characters and We'll get back a list of those we should only get one If we find one we're going to grab the email We're going to strip the lower case and if we got some little nasty less than sign in there We'll tolerate that as well So this is kind of cleanup and you get used to this where you're like, oh, how come all those email addresses have this other stuff in them um And then we also look for it if there are no less than signs And we do this way. This is and that's different some some mail messages have it this way and others Again, you write this code after you watch it for a while and you're like, oh, it's crapped out and giving me bad stuff And I make them all lower case so they match better and get rid of bad characters Why now I got an email address Then what I do is I look for the date of this so I got it I'm going to graph these by date. So I look for this line and use a regular expression To pull that out, right? So it's I'm looking for a date followed by a blank followed by any number of characters followed by A comma So I'm not interested in this Wednesday bit. So I'm skipping that bit right there And going and grabbing everything after that comma space. And so it's really here To The end of the line. So that's the new line. So that's it's going all the way It's going to pull this bit right here. That's the text And this is where we're going to like say, oh, that's kind of a funky looking date and we want to standardize that date. So we're going to Let's see Yeah, we're going to we're going to chop it off at the 26 character Apparently, I don't know what the 26 why that we care about the 26 character But we chop that off at the 26 character And then we're going to parse it and that's going to give us back a nice clean date sent at date Otherwise it's going to complete. We're going to quit And if we can't parse it, then we're going to tolerate five bad email addresses in a row Um, then we're looking for this subject line using another regular expression Subject line regular expression. That's pretty easy up to but not including right There's a blank there. It's a subject Let me pull that out. We get the subject Now at this point we parsed it and we got good stuff So we reset the fail counter because I kept saying if you fail five straight times you quit And we're going to print it out and then we're just going to insert that stuff We've got the the the idea of the message which We've got the email address that has came from the time it was sent the subject And then basically the headers in the body and we're just inserting it And now we're going to say every 50th. We're going to commit it So that speeds things up and every 100th. We're going to wait a second So that's you know count is going up up up up up up and every 50th You'll see it pause And then it will every 100th that'll pause for a second Mostly that's to let me hit control c or to to not overload any server Okay, so that's that's the simple one the problem is is this data just gets ugly And so you'll find yourself wanting to reset this and start it over this one's going to work, of course But it's these are hard to build And that's why it's a good idea Python Three g main dot py How many messages? Well, let's just do one Okay, so it went and grabbed. Oh do I have this already running? 51 through 52 Let me start over Let's find a cell star sqlite Okay rm content I must have run it to test it So let's run it again Python g main dot py and ask for one message Okay, so there we went and got message one from one to two We got 226 two characters and we printed out the email address the time we got it after all that hacking And the subject line and that's what we got So if we take a look at the database And we go into the g main Well, every time you see the content sqlite journal that means it Needed to run a commit and it hasn't run a commit But I'll hit enter and that will do the commit and you see that vanish So now I can open it And I take a look at I'll come there's no messages Did that one not get stored in there for some reason? used to refresh Huh, let's run it again Maybe it didn't commit Maybe got a bug in it Let's make a change to the code I'm going to See this connection dot commit See that Connection dot commit gonna commit there And the other thing I'm going to do is every time I stop to read I'm going to commit right before I read it So I think we should I hope that doesn't blow up. We'll see so the idea is is if I want to stop I want to commit it So let's do this. Let's do one message And now I should hit Is it committed now that I should put the commits in I think that I will it will look better I can't refresh and so there it is because I committed it And I don't have yeah, I don't have the journal file. So that's good So that's a good idea to put those commits there. So I'll just leave those commits in when you download it It'll have those commits in there Um, so again, I put a commit here and a Commit at the very very end To make sure and then I so that I missed that But now we get one right And so let's just run it again and you'll see how by selecting the max of the id It's going to select the max of this and then add one to it So it doesn't it doesn't do the next one. So if I run it again I say give me one message so it goes two to three And give me two messages Right, so I hit enter and I can do refresh and now you see we've got four messages Okay, and so that's just uh, let's just fire this baby up tell it to get a hundred Run run run run run run All right, it just goes and goes And you'll it pauses once in a while to do a commit and if I if I made a commit every time Oop, it just both paused there now it finished. So this will run and we will get a bunch of data uh The problem is is if I just run this it'll take about five Hours okay to run this and get this all and I've got a really fast connection. So I have got a file that you can download. Let's go find it Let's see if I can Let's see how long it'll take me to download this I've got a file that you can download and save now. I'm going to use the command line curl Or wget is another command that we linux and mac people can use I don't know you might have to use your browser do it. Let's see how long this is going to take It's retrieving minute 30 Okay, well, I'll I'll just wait when this come back Okay, so now that's done I was averaging 10 megabits a second. I downloaded about 600 megabytes 10 megabits a second That will probably be slower for you But so now if I take a look you're going to find that that content dot sql light is 624 megabytes now what happens is I've pre spidered this and so now if you run g main dot py And ask for for five more messages. It will pick up where I left that one off So it's up to message 59,000 and I think that oh you saw an error. So you saw a bug in that one I don't know what's wrong with that one. So let's see if So at this point, we're going to have most of the data it might find its way to the very end Once you get this, it's it should be not too much more. I don't know. Maybe it's like 63,000 or something So what we'll do is we will let that run And we will come back when that's that one's finished and And run the next phase after it's got all of its data. Okay So thanks for listening The work that we're doing right now is we are in the process of building a spider and visualized a visualization tool for Email data that came originally from this Website g main, but I've got my own copy of it. And so what we've done before is we ran g main dot py and I grabbed a url I have a url that has all this data and I downloaded that And then I ran g main again To catch up So and so it took quite a bit of catching up But by the time I get to remember how I said it run tries to fails five times Well, it ran out of data at 60,421 and then It started failing and then it quit and so we pretty much have all of our data now We have all we have finished this process and as content sql light Okay, and If I take a look in the database browser, we can see we've got 59,823 Email messages and so if I look at any of these things You see the headers. You see the subject line. You see the email address. You see the body of it so remember I split the body into in half and And the headers and so that's I made this as raw as I possibly could Because as you saw I had to spend so much time in the g main just get my data Successfully retrieved and so I don't like cleaning the data up too much And so what we're going to look at next is the data cleaning process Okay, and um, and so this is gmodel.py is what we're going to take a look at now So let's get rid of those guys and look at gmodel.py I don't think I need url lib in this code. Do I have any url lib? So I don't need that. Sorry Fix that Okay, so It's going to read from the database. It's got a call It's going to use regular expressions and zlib is a way to do some compression And so I'm going to do in this one I'm going to compress some of the data to make it so that I have less data to Some of the text fields are going to be compressed. I wanted to keep these fields uncompressed inside of messages And uh, so we so so we have some just cleanup messages and Cleans things up and it turns out that the the way email addresses in this particular Mail corpus they changed over time and we There's certain kinds of things sometimes the gmain.org is the email address when people want to hide their address And I made all kinds of stuff and I split it and checked to see if it ended with this and I cleaned up things just just that thing and so I have all kinds of cleanup stuff going on in here and I have this mapping and dns mapping that I'll talk about in a bit where Um organizations sometimes sent email with different addresses over time and people sent email from different time Um, and we're going to do the parsing of the date and that is the code for that um I'm going to pull out the header information. This is uh sort of borrowed from the uh The other code, uh, we'll clean up the the email addresses and the domain names And we'll pull the date out pull the subject out We'll have the message id Various things so here's the main body of the code We're going to go from content dot sqlite to index dot sqlite And what I'm going to do every time is I'm going to wipe out index dot sqlite And drop the the messages sender subjects and reply. So this is a normalized database and that it has foreign keys So there's a messages table here with the integer primary key the guid for it The guid stands for global unique id sent times sender id And and it's going to have a blob these are blobs by near large objects for the headers in the body because i'm going to compress them in this database to make them And then the senders As each sender has a key And then uh subjects each subject line is going to have a key and then replies our connection From one message to another and so this is like a many to many Now I also have this file called mapping dot sqlite and so we can take a look at that one mapping dot sqlite and so what happened is um This has a two tables that I hand deal with and so uh sometimes india this was a email address that mapped to that That's so indiana dot edu. That's a way to take Then that's the email address and then these were a bunch of people that had Email addresses changing throughout the project and I sort of kind of mapped them In a way and so this is just sort of like a I pull this in really quick and I read all this stuff From the dns mapping and I Other than stripping and making this lowercase, etc I just am going to make a dictionary dns mapping, which is the old name to the new name and the Email address mapping from the old name to the new name and I'm using fix sender Fix sender is because the the email addresses even within g main were kind of funky. So Don't worry so much about this Um Okay, and so now what I'm going to do is I opened up a connection just to read all that stuff in And now I'm going to actually open the main content and I'm asking it to open this a little trickier. I open that read only Um, that was so that I could potentially be running the spider and running this at the same time I get a cursor And so I'm going to read through so in the content file. This is the big one I'm going to read through and go through every one and write all of these things in And I'm going to take all the email addresses and I'm going to put those in a list Um So I loaded that I've got the mappings loaded Um, and so now I'm going to go through every single message I got all the senders all the subjects and all the global unique IDs So I read in each message. So now I'm going through content one at a time I um parse the headers I check to see If the sender is name email address after it's been cleaned up Is in the is in my mapping mapping dot get sender and the default is I get back sender. That's what that's saying Look up sender. If it's in there, give me the entry of that key. Otherwise, give me sender back Uh, we're going to print every 250 things we do Uh, we'll complain if this is true We're going to go get the mapping between the senders, which is a way to look up the primary key I could have done this with a database thing, but I wanted it to be fast So that's part of the reason I read all these things in So I could have those mappings to be really fast. You'll see this takes a little while even though it You know, even though it's I got all this stuff cached um And so then if I don't have a sender id Meaning that I haven't seen it yet Then I'm going to do an insert or ignore into senders And then I'm going to do a select and then you've seen this where I grab the row back and I'm really just trying to look at the recently assigned id and Then I'm going to not only set the sender id for this iteration loop, but I'm also going to store it in The dictionary and so that builds this dictionary up And you'll see the same thing is true for subject id I'm going to insert it into the subjects table and get a primary key If I don't know what it is and then I'm going to put it into not only am I going to put it into the database but I'm also going to put it into My dictionary and the same thing Um, I guess I didn't do it for the GUID Okay So now what I have is the sender id and the subject id Which are foreign keys into the center table in the subject table And I'm going to insert the message with the sender id subject id the sent at headers and body and The values here are the GUID sender id subject id sent at now this here is zlib compress So what I'm taking is the the message the header And the body and this little bit ends up with a compressed version of this stuff and you'll see it in a second And this keeps the size of these text things down at the cost of the computation of We have to took the at the cost of the computation to compress and decompress when we want to read it Okay And then I pull the GUIDs out the the id which is the GUID Um, and I pull out the primary key for this thing based on the GUID And I update this dictionary. Okay, so Let me run that code It is doing a lot of cleanup and I'll tell you it took me a long time to make this work. So just So this code that I'm running now Oh Don't forget to take a python 3 chuck So you this is going to run every 250 so it did all this pre caching So that's how long it takes to do 250 now there's 60 000 in here and so this is Really busy the reason it's bouncing back and forth is that every time it makes this journal file That's and then does it commit so you can kind of see that it's um It's busy making journal files and committing and there's a lot of activity going on here Just so happens that adam shows me these files Okay, so it finished it took about three minutes to finish that right and so if we take a look at the size of the files We will see that the index is much smaller. It's Fully normalized and still uh 263 megabytes. It's all compressed. So let's take a look at that In the in the browser So it's 200 megabytes But it loads up a lot faster There we go So We have a sender's table, right? Which is just kind of a a many to one table We have a subjects to table, which is a many to one table And we have messages which has uh foreign keys It takes a little bit to load that up Okay, and so So we see the foreign keys for sender and subject and we're and that saves us all those foreign keys save us And so we have you can kind of see that I can't see the headers in the body because now they're compressed that saves me A whole bunch of stuff, right? It saved me a whole bunch of stuff You know and so uh, so that's what's in that file And that we've finished this process, okay? And we've finished modeling the data and making it really clean and we'll pick back up and the rest of the stuff We will do is actually visualizing pulling data out of index dot sql light The idea is this can be restarted this can be run over and over and over even though it takes like three minutes to run this That's way better than uh five hours to run this so three minutes Five hours and then you'll see and we'll see now reading this is in seconds because we got it all nice and normalized In a quite pretty way. So, uh, I hope this has been useful Uh in the next one we'll actually do the visualization We are in the process of retrieving data from this g main server one that I've made a copy of And we have so far, uh, spidered it all ended up with 600 megabytes of spidered information We have ran a rather complex cleanup process that you probably don't need to fully understand You can look at it for patterns, but in general the the cleanup process will be very sensitive to the data Um, and then we have this index dot sql light, which is 260 megabytes right now And uh, we are going to now do a this the easy the the fun easy bits here Where we're going to run little queries that just pull data out. So these are much simpler. So Part of what I wrote when I was doing this is I wanted to do some simple basic, um Calculations on the data to make sure I really was sort of looking for anomalies, right? What what was working? What wasn't working? So I wrote a series of really simple things like this g basic The g basic code just give me some give me some basic data, right? So I wrote things down and I counted things and so Um, do I need you are a LibreQuest in this one? I don't think so Let's let's fix that bug. It's not there. No reason to put any of that stuff in there So it just it reads that index dot sql light, which is our cleaned up data Um, it reads through and makes a dictionary this pattern. You're going to see a lot where I'm going to make a dictionary of id to senders Save myself repeatedly looking at things. I'm going to grab the subjects. I've cached them all I could have done this all with sql, but I just Wanted to do things faster um, and now I'm going to go through uh each of these messages And make a dictionary of them going to put a lot of stuff in memory and then I'm going to do some counts I'm going to see Who is sent the most right the organizations and so now I've got to go through the uh all the messages um I am not actually so you'll notice that I'm not selecting the body or the headers here I am just getting sender id subject id Um, I probably could have done this with a join. It would have been cleaner You can do that you can make that change Do that with a join so it's cleaner um and uh So I'm going through all the messages except not the body So this is going to be really quick and I'm pulling out the the sender's id I'm breaking the sender into pieces see my data is clean now I cleaned it all up in the previous processes And if I don't have two pieces I continue and I get the domain name. So I have the person I'm doing a a basic dictionary histogram and for the people and the domains and then I'm going to uh sort them right With a sorted And we're going to grab the key. We're going to sort it by the how many there are reverse And then print out the top few Of the organizations and the top few of the people. Okay, so we'll just run that code Python g basic dot py py Let's type the dump out the top 10. So we loaded uh 59 000 messages 29 000 subjects and 1800 senders and figured out the top 10 people and the top 10 organizations and You can you can write various things like that that just sort of scream through your data and It's good to get sanity checking on your data Okay, so that's g basic Now I want to do uh g word dot py Because that's kind of fun um g word dot py I don't need url. Live. Why do I keep putting url. Live in all these things? So we'll get rid of that um So this is really simple because I'm just going to go for the words in the subject line And so I go through index dot sql. I I read in all of the subjects um And I make a dictionary of those and then I go and find all the subjects This uh, and then I'm doing this code right here. I'm pulling out the subject Based on uh, uh, based on the message Um, and I'm doing this so that I when the subjects are used more than once I count the words more than once Um, this stir make trans. Uh, I talked about that in uh in an earlier chapter Um, this basically throws away a punctuation in numbers So that when I make my words, I don't end up with uh words that are like dashes They it compresses them down Then I Strip it I convert everything to lowercase. This is basically just to keep too many words from showing up Then I do a split and then I got accounts a dictionary So this is a no trans no punctuation no numbers dictionary count And then I just take the And do a dictionary and then I sort them in reverse order And then I figure out what the highest and lowest is by running through a A I could have probably done this with uh, a max and a min if I felt like it um And so now I have the highest in the lowest You know, I should have done a max and a min on that one. Why did I do that? But oh well um And now I've got to Spread out the size and so I'm going to produce this file g word.js, which is Needed by the visualization because it's going to use Uh d3.js a word visualizer and g word.js I have to tell it how big the text is and so I'm doing some text normalization took me a little experimentation So if I run this now And I say python g word G word.js And I say python 3 g word.js, which is a lot better Oh not python Okay, so now I can go look at the g word.js wherever that is g word.js. Yep And so this is basically it it normalized all the frequencies um And made it font size. These are font sizes now Okay, and so this is just the data that's needed by this g word.jm which uses this d3 visualization Uh word cloud code So this pulls in all my data and then this is just some javascript that draws the The picture in uh on the page Okay, and so the easy part now is to just open g word.htm In a browser it just so happens on a mac I can do this And so that gives me a word cloud Based on that data It kind of randomizes it shows different stuff, but it's using it's using this It's using this data To generate how big those things are and then using a bit of randomness and simulated annealing to lay it out That's not stuff that we actually have to worry about. Okay So that's how we get to the point where we're seeing a word cloud from this um Now We're going to do another visualization and this time we're going to do a line visualization And we're going to create a thing called gline.js and produce With another html file. We're going to use d3 and produce Uh that output. So let's say goodbye here. Goodbye. Goodbye Goodbye. Goodbye So gline.py Get rid of that file So again, I'm going to preload all of the senders Um in this case and again, I could have done this with a join probably should have done once with a join I'm going to preload all the messages um The sender id subject id etc. I'll load those up And now I'm going to read through I'm going to have the sending organizations and the senders And And I'm going to accumulate to split the senders and I'm going to have the sending organizations And then I'm going to do a a simple dictionary as I accumulate The sending organizations by splitting the person's name into ad signs And then based on the organization, I I'll accumulate it and then I sort them and I pull out the top 10 organizations print those out And now I'm going to um produce uh Break this down into uh months and I'll show you what this looks like in a second. Let's go to the gline.js So the month looks like this Okay, so the month looks like that. So that's the first seven characters of the date. So if we look at the date Date looks like that and the month is the first seven characters And this is the data that I've got to give it um We'll clean that up in a second that data will look better in a moment go back to gline.py and so so this is um We're doing a The key is a tuple which is the month and which organization is it is That that did it and it's only the top 10 organizations and then we're going to do a um We're going to basically do a uh a dictionary where the key is a tuple And then we're going to sort it Sort by key in this case Not by value That's uh The months is going to sort that and then we're going to write all this data out Into gline.js. So let's go ahead and run this and again. This is just the data that has to be written in a way that uh The javascript can understand it python gline python 3 gline.py Okay, so top 10 organizations. So let's take a look at that javascript. So this is what it looks like So it just so happens that you got a tellot the the the um These are the data points. These are the lines So there's this is the year the line for university michigan gmail.com swinsborg.com So this this first column is the that line points and the next line points So I you know all this code was to get the data In such a way that I could produce this javascript file because if I look at gline.htm I need that data in that particular format and I I've got all this stuff I make a line chart and I draw out with this data that data I had to go read all the documentation on how to figure this stuff out Right and that's the data that I'm going to use and I had to figure this out And I had to transform it and make it pretty it took me quite a while to get this to work And this is not a javascript class nor a how to visualize in d3, but basically um, we pulled all that stuff in and um Here's the gline that came from the javascript and then it makes an array to data table and then that data table is what G line draws. So with no further ado, let's open G line dot htm to show that data So there you go. That's the sky developer participation from 2015 through 2000 2005 through 2015 Based on which organizations did the the most commit since sky And so I know that I haven't done this all this code full justice. There's a lot of code here The fun is just to kind of run it and see it and then when the time comes to come back and see the techniques That are used when you're trying to build your own visualization pipeline So, uh, I hope that you found this useful Um, you know, this is a lot of code hard to explain in 15 20 minutes But I hope you take some time and look it over and I hope you found all these video videos This is kind of the last walkthrough video for chapter 16 of the book And so I hope that I will see you on the net