 Welcome back to the introduction on Python and programming course. This is chapter 6, which we talked about bytes and text. So text we have seen before as early as in chapter 1, and we know that it is modeled with the string type. So what are bytes? Bites is the underlying representation of text. So this is how text is modeled with 1s and 0s in memory. We will look into this towards the end of the chapter today because it is quite a generic data type to know bytes and it's the one format in which you can basically transfer any data over a network, for example. So let's start simple with the string type. So how do we take, how do we create strings? Well, we create strings by writing a text we wanted to model between double quotes here. And let me execute this cell. So what this does in memory, it goes ahead and it creates a box that is rather big and will put the text lorem, ipsum, and so on in it. We know already that it has attack, which is the str type, the start type, or string type. And then after this object is created in memory, a new name is created, which is called text, and this name references the object in memory. So how do we know that it is an object? Well, everything in Python is an object and it has as such three properties, which means it has an identity, it has a type, and of course a value. Now, I already mentioned briefly that we write text with double quotes, but if I look at the value of the text variable here, it's given out to me in single quotes. So why is that? Let's look at some other example here. So here we have a text that models a quote by Albert Einstein. And it says, Albert Einstein said, if you can't explain it, you don't understand it. So what do we see here? Because it is a quote, it does contain quotation marks as well. However, because we use the double quotes as, you know, the literal that limits where the text information starts and where it ends, we cannot use the double quote in the text itself. So what we do instead is we have to escape it. So escaping means that we put a backslash here and the backslash implies to Python that the following character, in this case a double quote, is not to be interpreted as the limiter here, but instead as the character double quote itself. So if I execute this code cell, I get back a text string which says Einstein said, and then we have a double quote here. So it works. But wait a minute, what do we see here? We're given out the text string with enclosing single quotes here, right? And now because we have the English word can't and don't in it with, you know, the abbreviation of the single quote here, now the single quote must be escaped. And Python does that for us by putting the backslash before the single quote here. So you may wonder why do I write the string with a double quote and Python gives it back to me with a single quote? Well, this is just the default that Python does. In fact, both the double quote and the single quote, they are perfect substitutes. So if I execute the next code cell where I just switched the double and the single quotes, what I see is it also works. And now I get back the string in the exact same way as I defined it. And so it both is equivalent here, right? And so for me, I use the convention that I use preferably double quotes around the string. This has various reasons. One reason is that, for example, I use some code formatting tool which is called black and it's very popular at the moment. So you can just search on the Internet for Python and black and then you should already find this tool. And what the tool does is I write code and I give it my code and then the code gets back and format it in one unique way. So if, you know, if I, for example, make two spaces instead of four for indentation, then the code that I get back from this formatting tool will have four spaces. So it has, there is one way to format the code. This is why the tool is called black because Henry Ford said you can have the car in any color as long as it's black. That's a famous quote that is attributed to the founder of the Ford Company. And because of that, what he implies is that in order to be efficient in his production in the facilities at Ford, you don't want to have too much variety in the cars. And this is why Ford back then shows that you can have the car in any color as long as it's black. And in the same analogy here, I can write a code in any way I want, but the formatting tool gives me back the code in a unique way. And the black tool is one of the most popular tools out there at the moment. And the black tool defaults to double quotes. And that's just the convention. And also I have seen the double quote convention being used on many software projects. And in particular, the reasoning goes like this because usually software uses English-speaking text predominantly. And in English language, we have the single quote a lot. We just use the double quote as a delimiter so that we don't have to escape all the single quotes. However, that is a bad argument because you could also make the argument that in any text you're most likely at some point going to use a double quotation mark as well. And so you could also make the counter argument here. In other words, this is an example of where a programmer may have a very strong opinion, but you cannot really justify this opinion with an argument. And so syntactically, both of the cells that you see here, they are perfectly equivalent. They are perfect substitutes of each other. And Python defaults to this notation. We just have to get used to it. The developers of Core Python, they just had to make a choice and they choose the latter here. And in this book, however, I prefer to use double quote notation. The double quote notation has also one big advantage. You cannot confuse. If you know that your code base only uses double quotes, you cannot confuse one double quote with two single quotes and two single quotes would, of course, imply a string with no characters inside. So a string with no characters inside in the double quote notation is a lot easier to spot because it's clearer. However, again, this is an argument that you can hold against me because it's actually not a good argument. It's just a convention here. However, if no matter which of two strings we use, if we print it out with the built-in print function, we just see that the sentence is printed as we would expect it to and both the single quotes and also the double quotes are not escaped. So the text appears on screen as we want it to. And also just to reiterate an important point, the first two code cells, they are expressions that evaluate to a text string. So they evaluate to a text string just like Laura Mipsum did here. And then because we don't capture it in a variable, we get it back as output. That's why we see out here and out here. And the print function processes the object that gets evaluated. And because of that, it does not return as a value. The print function has no return value, at least not an explicit one. The print function only has a side effect. And because of that, we can see the side effect here. The text is printed to screen, but I don't see the output here. So that's just a minor grammatical observation here. So how else can we take, how else can we create text strings? Well, we can use the STR built-in constructor. That's a built-in callable. And just like we saw the int and the float constructor in the chapter five, now we have the STR constructor here. And for example, I can pass it an integer 42 and I get back text data, a text string with the value 42. The same goes for floats. The same goes interestingly for list. And the same goes for booleans as well. And it would also work for none. So if I replace true here with the none type, it also works. So this has to do with the fact, as we shall see in chapter 10, that any object should have what is called a text representation. So we have seen that integers have a so-called binary representation. We saw that in chapter five. These are the zeros and ones that are in the box that make up the value of the integer. And the same holds true also for floats. Also has a binary representation. Anything should have a binary representation at the end of the day because a computer can only model zeros and ones. But we also saw that it's kind of tedious to always write out all the zeros and ones if we are interested in how something is represented in memory. And because of that, some other representations exist, most notably one that we also saw in the previous chapter, the hexadecimal notation, which is basically a number that instead of just using just 10 digits that we use as humans in the way we count, it uses 16 digits from zero to nine and then from A through F. And on top of the hexadecimal representation, there is yet another representation. And this is what we will call the text representation. And the text representation will become a bit clearer in chapter 10. But for now, just note that basically any object in Python should be represented in text somehow. How else can we get text strings? Well, we can use the built-in input function, which takes input from a user. And the input function takes a string itself as the argument. And what it does is it uses this text to prompt the user with it. So if I execute the cell, I see the exact same message here that I entered here as a text string. And now I can enter anything I want. For example, I am text, enter it. And now if I go ahead and check what is the type of the variable user input here or the type of the object to which user input references, it's of type str string. And the value of user input is, of course, I am text. And note that the input function returns text no matter what. So even if I just enter nothing, I still have a store type here. And I still have a text string output here, but now the value is the empty string. So a string that contains no characters. So we saw input before in chapter 4. And the section on indefinite loops. And there was an example on guessing a coin toss. And in this function we used it. And we used it to ask the user if he wants to guess heads or tails. But if we, and heads or tails are, of course, text strings as well. But if we want to use input, for example, to obtain an integer from a user, we could do that as well. And how would we do that? For example, we would change it a bit. We would say enter a number. And then we would wrap the input function with the built-in int constructor. Actually, we did it again. And now let's say I write the number 10. And now the type of user input is int. And user input is now the integer 10. So I can adapt the input function by wrapping it a bit. But by default, the input function always returns text. Another way of creating text strings is by reading in files. So I'll give you an example. In the project folder, you will find a file called lawrim.ypsum.text. And I have opened this file here in the sublime text editor. And it's just dummy text at the end of the day. And it consists, as we see, of seven lines of text here. And the last line is empty. And before that, we have six lines with just dummy text. So how can we read in this data file into Python? Well, there is the open built-in. And we just call open with a text string in it, which is the path to the file. And because I opened Jupyter Notebook in the project folder, I don't have to specify a location. I can just specify the name of the file. And whatever is returned from open will be stored in a variable called file. So let's do that. Another question is what is a file? So let's check what is file. Let's ask Python what is the type. And Python tells me it's a text IO wrapper. And that is weird. For now, because I told you that it's a text file, and when we open a text file, we expect that we read in text, right? So what is file, really? If I check the value of file, we see that, still, it's a text IO wrapper. That's the type. And then it says the name of the file is lorem ipsum.txt. Mode is R. And encoding is something called UTS8. So we have to understand a little bit more here. So first of all, what does the mode R mean? Mode R means we open the file in read-only mode. So I can now look at the text inside the file, but I cannot change it. And the encoding, I will explain towards the end of the chapter today what that is. So we can, of course, use methods or on the file object or attributes to check certain things. For example, I can call the method readable, and what readable returns is a true or false. And it basically answers the question, can I only read this file or am I allowed to write? And because it's a read-only file, readable returns true. And then there is also writable. The writable method returns false. That's another way of checking that can I only read the file or can I also write into this file? There is an attribute called name and so I can also access the name as a text string itself on the file object by saying file.name. And lastly, of course, I can also get the encoding of the file, which again we saw at UTS8 and will be explained later. So the file object can do a whole lot. But let's see what the file object really is. So what happened in Python in memory is an object was created and the object can do something, but we don't really know what it can do. So let me for now just write loram underscore ipsum.text here. That's the path to the file. And this is of type of file, of text IO wrapper. So let me just abbreviate it with text IO. And then this was stored in a variable called file. So let's create file and make it reference the object. So that is what happened in memory. Now the question is where is all the text that is in the file, right? How can we access it? And the thing is this, by default, you should think of file objects like this. This file object here, it acts as a so-called proxy, a proxy to the outside world in a way. So what I would do, how I think of it is, I think of it as this. Let me maybe draw a border here. Okay. And on the left-hand side, this here is Python, the Python process or Python world, so to say. And on the right-hand side, I just write OS for operating system. So this can be, in my case, it's a Linux system. In your case, it's probably a Mac or a Windows system. And operating systems, when you start, for example, Python, what the operating system does is it reserves a certain area in memory and gives it to this Python process and says, well, within this memory space, you can do whatever you want, Python. And this is the space that, so far, we have used as a small part, which is a list of all the names, and then the bigger part of memory where we have all the objects in memory. However, that doesn't mean that this is the entire memory of our machine. In fact, that the OS uses a lot more. And, in particular, what the OS also does is it provides access to the files on disk. So, in a way, I think of it like this. There is a file. Let's draw it maybe. Let's maybe zoom out a little bit first. Okay, let's draw the file like this. There's a text data here, somehow like this. And this is somewhere on disk. We don't know where it is, so it's definitely not in Python. And then what Python does, when we open the file, what we get is, we get, maybe I use a different color here, I use green, I get kind of like a tunnel, so to say, to the outside world. So, somehow the text object can look outside the Python memory and it can look on the disk right now at the beginning of the file here. Basically, it looks at the first character in the file. So, now, what I can do in Python, when I want to get to the contents of a text file, I can just loop over the file object. So, I write here, four line in file, so I'm looping over file. And then I just print out every line. Let's execute this. Here is all the contents of the file that I just opened in another text editor as well. Okay. So, note that we saw before in chapter four, we used the technical term of an iterable. And in iterable, I told you if anything that can be looped over, in that sense, a file object is also an iterable. We can loop over it. So, why does Python do that? Well, first of all, imagine you easily can have files on disk that are larger than your computer's memory, your working memory. So, in other words, if when you opened a file, if you would copy all the contents of the file immediately into memory, there is a risk that your computer may die because it goes out of memory. So, in other words, it is a smart decision here by Python to just give you back a so-called proxy object with which you can look outside and then you can choose as the programmer of how you use the proxy object. And what I did here is I looped over the file. So, in a way, I started here at the first line and then in the second iteration of the loop, I'm reading the second line and the third iteration, the third line and what happens? Well, what happens is if I, for example, execute the for loop again, I don't get any output and that's interesting and the reason for that is because after the first time around where I read the file, what happened is this thing through which I see the file, it moved down one line by one and it ended up here. So, at the end of the day, I'm not here anymore but now my file proxy object is looking through this tunnel to the last line, actually beyond the last line, to the last it's to the right of the last character in the last line. So, if I now run the for loop again and again no matter how often I run it I don't see any more contents because what the file proxy object tries to do is it tries to continue to read in the file however, there is nothing to be continued which is why I don't see any output. Nevertheless, we have also learned that after a for loop is over the target variable which is set in every iteration of the loop still exists after that. So, what is line? So, line is still set to the last line in the file obviously. What is line? If I ask Python what is the type of line it is type str So, by reading by looping over a file object I can create a string object as well naturally in a way. So now the question is what can we do with this file object if we are at the end of the file? Well first, we could and I will not show this here but we could of course tell the file proxy to move this tunnel back to the top and start reading again. We could do this but usually we don't need to do this or don't want to do this and then also note that as long as this tunnel here exists we say that the file object is still open. So, the file object comes with an attribute called closed. So, if I execute here file.closed I get as the return value falls and that means the file is still open. So, the tunnel is still alive. The question is why is that good or bad or what's the indication of that? Well, in the way operating systems work the number of open file handles of tunnels from one process to the operating system is limited and it's limited to an actually rather low number. So, I think many operating systems they can just have at most around 1,000 files open at once and that's just that's a low number so whatever we should do after we are done reading a file we should of course close it and how do we do that? We call the closed method on the file object and if I now ask Python is the file still closed or is it closed and then Python says yes it is closed and it does by calling it the tunnel here gets removed and now the file is closed and if I now try to read the file object again let's go back to the for loop if I now try to loop over the file object again I would actually get a value over here because it says I'm doing a IO operation IO means input output an IO operation on a closed file and that's of course not possible so the file proxy is now closed we try to read the file but we cannot read it because we closed it okay and the thing is because I just told you that you always should close a file there is a better way to do that in Python code we can use the open built in in what is called a context manager it's in other words it's called the with statement so we have a header line here it starts with with and then we have the open and the path of the file and then we say as file colon and then we indent the for loop by four spaces here and what this means is this code here the for loop is executed this context manager and the context manager is created by the open function so in other words open returns a file object which we saw here which is of type text IO wrapper and a text IO wrapper object can act as a so called context manager and whenever an object can act as a so called context manager we may write it within we may use it within the context management and that is just to executing code within the context of something and what does the context manager do so first I execute this code cell and the file gets opened a second time so what really happens here is because I'm using the file variable again what would actually happen in memory is we create a new text IO object I don't do this here I would create a new text IO object somewhere else in memory and then we would make the file name here the file variable name reference the new object and then we would remove the old object just as we do with objects all the time and then what the with context manager here does it automatically closes the file so after we leave the context of the context manager here the file is automatically closed by asking the file if it's closed and the file object confirms to me yes I'm closed so this is a nice feature by Python and you will see many you know many context managers in many many different context so to speak and the context manager that comes with a file object is just one example of it and in that way they usually help us to ensure that some usually a context manager has two jobs the first job is to set up something before something is done and then to tear down something before after something ended so the setup doesn't really happen here because a setup we just have to open a file this is already happening when we call open here but then for the tear down that happens at the end the context manager basically calls the closed method on the file object automatically that's all it does and it would even close the file if I ran into an error here right so if I for example write 1 over 0 here which will create a 0 division error let's do this I get a 0 division error then even then the file object is closed so even if an exception occurs within the context manager so within this code block the context manager accepts for this exception so it catches it and then it closes the file and then it re-raises the exception so that we see the exception at the end of the day but even though we saw an exception the file is still closed and that's a good thing that's what we want that means as a best practice if you want to read in files do it with the with statement so of course instead of looping over a file object what we can also do is this here I create a new file object I open the file I can also use for example the read method and the read method on the file object takes an integer argument so I say 11 and what that means is it basically says to Python the first 11 characters in the file so let's execute this and this is just and now if I call this method again so what happens then is I get the next 11 characters so this is like the tunnel that we saw here the tunnel is now not moving on a line by line basis but the tunnel is now moving a certain number of characters and if there is a new line or not we don't care the lines get basically ignored we just move along the characters in increments of 11 in this case and then of course there is a read line method and what the read line method does it continues to read from where we left off until it hits a new line so what we see here is I just get the remaining first line so maybe I compare this with the original file so the first 11 characters are here the next 11 characters are here and then the read line method just showed me the end of the first line that's what I'm seeing here and if I run the read line method again I just get the next line, the second line in this case and then there is one more method which is called read lines so plural and what this does is it gives me back all the remaining lines in the list object so now I have a list object that contains all the remaining lines and then of course because we open the file and now we know we should close it I just close it here manually so this was just to show you that there are alternatives to looping over a file looping is maybe the nicest way so if you know ahead of time that you want to do something in a new file then just use the for loop but if you need to process a file in a different way then it is maybe better to do it with the read method or the read line or the read lines method depending on what the goal of your program is these were all the ways the most common ways of how we can create strings in Python and now let's analyze a little bit what a string is so first of all why is it a string called a string well string is just a formal word that exists for historic reasons and the more modern word which means exactly the same in computer science would be a sequence so string and sequence are synonyms but as I said for historic reasons when we talk about string data we usually talk about strings and so how should we think of text I have here written the headline a string of characters and this is really how we should a few text so text is nothing but a sequence and we will learn in the next couple of chapters in chapter 7 in particular what a sequence is but we should in a few text a text as a whole which of course we should always interpret text as a whole but it's more important to also see the individual characters in a text so a text is something that is made up of individual characters that's the few we should have so here is our example text from the beginning and now I told you it's a string of characters and I also told you that a string is more formally known as a sequence and now why do I use the term sequence so sequence means a text has four properties the first property is the text has to be finite any sequence must be finite so if I call the len function and pass it to text I get back 27 because 27 is the number of characters in the text so we saw len before but when we used len before we called it with a list object passed in so we used it for example to calculate the length or the number of elements in the list numbers in the first chapter for example so what we see here is that the len function accepts a list but it also accepts a string so what is the type that the len function accepts and the answer is we see here duck typing as an example that's an application of duck typing here and duck typing means we don't really care what type of object we pass in we care more about how the object we pass in behaves and in one way, namely in the way of being finite or infinite both a list object and a string object behave in the same way they are both finite in common here and because they know the concept of length that's basically the idea the abstract idea behind and because they do know both the same concept, in this regard they both work with the len function another thing, what we can do with text we can loop over it and in chapter 4 I told you that anything we can loop over is a so-called iterable because of that a text string itself is an iterable so what I do here is I just print the individual characters in a text and I put some more spaces in between them and note that as I loop over text in every iteration of the loop I get exactly one character back so this is what I meant when I said that we should view text as a string of characters so it's the characters that really have the text string and because of that looping occurs in iterations of individual characters and then that's the second property, the first one being knowing the concept of length, the second being an iterable, the third property is that sequences and strings can be ordered and how do I see the order in the text string? Well if I loop over the text here I see it occurs in obviously the same order as I defined the text this seems to be trivial but it's an important fact and then whenever something is in order what that means is I can also loop in backward order so maybe you find the word reversed or the more technical term here reversibility to be a bit something to be reversible it has to have a forward order to begin with otherwise we couldn't reverse something so when I say I can loop over text in reversed order that basically means we have an order so that's the third property of a string that it's ordered and then there's the fourth property and we saw this before in chapter 4 when I talked about intervals I made a distinction between intervals and containers and containers are any data type that basically contains or consists of other objects so in this case a text string as I said consists of individual characters and how can we check if an individual character is contained in a text? Well I can use the in operator so I can for example ask is the letter uppercase L in the text and I get back through so that's a that's a Boolean expression at the end of the day the value is to either true or false and then I can also say is for example the word ipsum in text and I also get back a true so I don't only have I cannot only check if individual characters are contained in a text but I can also check if so-called sub strings are contained in a bigger string but the property that I want to emphasize here is the container property so text contains other things so yeah that's why and then of course I can check some random text to any BDVG which Julius Caesar said famously and this is of course not contained in the text we are given which is why I get back a false so just to summarize what is a string a string is another word for sequence and the sequence is any data type that has the following four properties the first property is it knows the concept of length we can check that with the land function the second concept is we can iterate over it it's an iterable we saw that with the first a for loop the third the third property of a sequence is it must be ordered we saw that basically also with the for loop because we see that the characters we obtain them one by one in a forward order but we can always check if something is ordered by passing it to the reverse built in and then we can loop in backward order or in reverse order and whenever we can go over something in a reverse order we know it must have a forward order because otherwise we couldn't have a reverse order and then the fourth property again is here is the container property so that means a sequence is always something that contains other things other objects most of the time so these are the four properties these are absolutely worthwhile to know because everything that fulfills those four properties is considered a sequence and whenever something is considered a sequence then many nice things follow from that so for example because some strings are sequences we can exploit some of the four properties that I just mentioned so for example I can index into a string we saw indexing before with lists and we saw that we have to index with zero based indices so if I want to obtain the first character index I index with the number zero and I get back the uppercase L so indexing only works because I have an order because otherwise if the sequence were not ordered how would I know what is the first letter and you now may wonder so far you haven't seen anything that is unordered let me tell you that just because we haven't seen this that means it doesn't exist in chapter 8 when we talk about mappings and sets we will see the data type that behaves in many ways like a list type but is unordered now you may ask the question why would I ever use something that is unordered if I can use something else that is ordered we will see then that by giving up order some optimizations can be done in memory which makes some operations faster there is always a trade-off here so order is really something that sometimes we really need sometimes we don't need it and when we don't need it we may want to use a data type that is not ordered however the string data type is ordered and because it's ordered we can index but without order we couldn't index because we wouldn't know what is the first element to begin with and the second property here let's first look at some other examples so let's get the second character we index with one and now comes the let's get to the end of the string if I want to obtain the last element in a string I have to index with the length of the sequence length of text here minus one and we saw that text has a length of 27 so now to obtain the last character I index with the number 26 and I get back the dot at the end what I must not do or what will not work is I cannot index with 27 this will raise an index error because we are obviously one index too high and so there is a second a second property in use behind the scenes which is the property of the string being finite because only if something is finite can we index from the right hand side from the end and how can we index from the end well what we can do I make this a little bit smaller for now so here we see the string again in the table and we see the indexes of the individual characters in the uppermost row and then we see in the middle row the reverse indices and we see that the last character the dot has an index of 26 but it also has an index of negative one and this only works because the string data type is finite and what that means is I can to get the last character I can index with just negative one instead of 26 so I will zoom in a little bit again and now we don't see all the numbers here but it's not so important you get the idea and so we can also index from the right hand side and this again only works because of the finiteness so to get the first letter we can also index for example with the negative 27 I am back at L so why is that useful to have well sometimes when you read in data let's say numerical data in a list or textual data in a string then maybe you only know the relative position of something so if you know that whatever you are looking for is in the 10th position and maybe you don't know how long the text will be so by using by allowing negative indices we can always index from the end without even knowing how long the actual data string will be and this enables a nicer way of working with sequence data and then let's generalize indexing a bit and this is called slicing so what is slicing slicing means that out of a longer string we get a shorter one for example if I want to extract the first five letters from the text string I do that by saying text and then I use the index operator and instead of passing in a single integer I pass in two integers separated by a column so this means get me the letters, the characters from the first one to the fifth one however as we saw for example the range built in before the left index the zero is included in the result and the right index won't be included so we get back the first word and why is the five not included well this has many many reasons but one nice property of the upper index not being included is that when I calculate the difference of the right minus the left which is five I know the length of the resulting text string of course I could also index from somewhere in the middle to the end how would I do that well as I know that the right index is not included I will index in this case from you know the 12th character or the 13th character really from the 12th index to the length of the text and length of the text we know is the right hand side is not included in the index this still works so now I get back the second half of the sentence which just says I made and we also see the dot so the last character is indeed included okay so whenever we start an index with zero or we have the upper index being the length of the string what we can also do is we can also just leave it out the first five just means take the first five characters of the string and in other words the default index if omitted is just zero same in the same way if the upper index is the length of the string then we can also omit it and basically here 12 colon means start at the 13th character and give me the string until the end so get back the same result here and again once you start working with textual data or sequence data again you may not know how long a texturing is for example think of you load in user input with the help of the input function and you don't know how long or what the user will type in and how long it is so if you only want to let's say look at the first five characters that the user entered then you just type this or if you just want to look at the end of what the user entered you just want to look at this and then your code is kind of like independent of how long the string actually is so the way these are equivalents here but again in a real world program leaving out indices can actually be like an added feature because that means we don't have to know what the string is so it enables a nicer way of working with data and then of course we can combine positive and negative indices and let's say we go from the 6th index to the negative 10th index to get ipsum du lore and again also here whenever you see the first index being a positive number and the second index being a negative number there is a nicer way to interpret this this just means throw away the first six characters and throw away the last 10 characters and keep everything in the middle so at the end of the day we cannot calculate the length of the result by just subtracting the right hand side minus the left hand side this only works if both indices are positive but if one of them is negative what we can do is we can have a relative view on text data and then one thing that you see often in code is we just write colon and what colon does we can already predict it the left index will be 0 the right index will be the length of text because these are the default indices so we will get back the entire string so now you may wonder why would I ever do that well this is usually done to get a copy so in other words when you want to create a second object for whatever reason then you will use the colon operator to take the so called full slice and then you get back a copy of the entire text this is not so important for text data actually but as we will see in chapter 7 on numerical sequences this will become very important with array data and data frames in chapter 9 this will be even more important because then one goal of my course is to prepare you to work with big amounts of data and whenever we talk about big data that means that we are necessarily constrained with our memory, with our computer's memory so you may end up being in a situation where 80% of your computer's memory is occupied by only one matrix for example and it's not possible to get a copy and in this situation you have to really know what your code does and just writing a colon in the index operator will give you a full copy and sometimes this is not what you want so I just want to illustrate it here that you already know what this means the colon notation and then besides having a start and stop index we can also have a third index which we will call the step size and this basically means we can control if we want every element or in this case in the example if we want to only take every other character so here what this means is I take a slice from the beginning to the end and I only take every other character so I get back only every second character sometimes this is useful if you want to down sample your data but then the assumption is that whatever your data is in the text string is kind of like you know only distributed by random chance in a way and you know what I'm saying may not be so intuitive with text data but again just notice that text are sequences and whenever the sequences contain let's say numerical data all the operators that I'm presenting here are basically in the same way and then maybe these applications really work and then of course we can reverse the order by having a negative step size so this will just turn the string around and yeah so that's indexing and slicing okay one note that I also said a little bit about the numbers in the last chapter in the text numbers in chapter 5 but also text here in chapter 6 has one property which is called the immutability so what that means is after you have created an object in memory you cannot change it so we said that an object is a box with zeros and ones in it and for some boxes as we have seen we can actually after we created it in some other boxes we cannot change after we created it and one box that we actually did change after we created it was a list so in list remember in chapter 1 it was already where I you know where I replaced one element in the list so a list objects are obviously not immutable but text strings will be immutable so what does that mean so what that means is practically I cannot assign to an index so I cannot go ahead and replace the first letter in the lorem ipsum text here so if I do that I get a type error and it says the string object does not support item assignment I cannot assign to strings and also I cannot assign to slices so that's again just to say that once the lorem ipsum is created we cannot replace any characters the only thing we can really do is we can make a new string where we change some of the characters but then again we will have a new string a second string object okay so now let's look at what are typical behaviors that we require a text string to have and the behaviors manifest themselves in the string methods so methods the functions that are bound to the objects and for text data there are some very specific methods bound to it so for example I have my lorem ipsum text here and let's say I want to find or search for some letter in there let's say I want to search for the letter a and so I just say text.find and then I pass in the character a and I get back at 22 and what is 22 well 22 is just the index at which for the first time python finds the character so okay and then of course if I look for a character that is not in the text string I get back a negative one here I don't get an error you will see in the chapter when you read it that there is an alternative index function as well index method as well that works just like the find method but that would actually raise an error here so sometimes so the methods they work they behave a little bit different in some situations but yeah the find method gives us back a negative one to indicate that it didn't find a certain character we can of course also look for entire substrings so for example I can check if a word is contained in a sentence and then the 12 is just the index of the first occurrence if I check for all I get back one why one because in the second position there is an all however let's say I want to find another all how can I do that I can pass it a second argument called start it doesn't take the argument so what does it take okay so here is text again let's look at some other methods let's for example count individual characters so for example let's count how many L's are in the text and the answer is one and this may surprise you because there is of course an uppercase here and there is a second letter L here so there are actually two L characters in the text and so it didn't find both of them why not well because Python is case sensitive so if I want to truly count all the L's what I have to do is and we saw this before I have to lower case the text and then I can count for all the L's I could also alternatively uppercase everything of course okay so let's look at another example to show something else that has to do with the fact that string types are immutable so I create a new variable example which is the text random and the random the text here is already lower case and now I go ahead and I call dot lower on it to lower case a text that is already lower case and I save the result in a variable called lower and now the question is our example and lower two different objects the answer is false however if I ask the question are the two variables, are the two texts the same do they have the same content the answer is true and so here we see immutability basically at work so whenever a method so presumably changes the value of the text what Python does because the string type is immutable so what happens here in memory maybe we can fit it here as well so maybe if I draw here a text string with the word random in it and I assign that to example and if I now call the dot lower method on it what really happened is a second text string is created which also has the same value and the name is lower here and we have a reference here as well so that's to emphasize that whenever you call a method on a text string and the text string will return a possibly modified version of the text string it will always be a new object and this is because the string type is immutable in other words because we don't know if we have to really change a character Python really always defaults to creating a new object even though this would have been totally unnecessary here we could have had a reference here from lower to random because that's the same but here again string methods they always return new objects and this is due to the fact that possibly we the objects are just immutable and in the case we would have to change the character we are not allowed to do that so this is why Python always gives us back a new object let's look at some other methods so for example a very common method is the split method and what that does is it looks at the text string and it returns a list where it separates the individual words so in this case the split method uses white space so white space and of course a single space character is white space and it uses white space to separate the individual words and then we end up with a list that consists of the individual words and we see the dot belongs to the word to which it is attached basically and so what can we do with it well often times we have a big text string that consists maybe of many many words if you want to do let's say natural language processing of something and you want to do to process every word in the sentence so how can we do what can we do here well we can use dot split and we could technically loop over it so here I simply show you that I put spaces more space in between the individual words this is just a toy example of course but you get the point if I given one big text string like let's say one big document of text and I want to do something with the individual words in it the dot split method is a good way to get to the individual words and there is the opposite method let's assume I am giving a list with those words in it this will become a sentence then there is a join method that can join those individual words together and how does it work because the join method must be used with a string we have to create a string first to call the method on it and then what this string is used for it is used for as the delimiter to separate the two to separate the words so in other words sentence now is one text string which consists of the individual words if I put a dash in here and I do that again then the words get clued together by the dash here so let's go back and do it with just a simple space again so join and split they are sometimes very useful and then a common mistake by beginners is they want to join together let's say a list of words but then they accidentally forget for example the brackets of the list here so if there were a list this would basically do exactly what we did above but if I join accidentally for example a text string what I get is I get this and why is that the case well the join method all it takes is an iterable of strings now we have learned that a text string itself is an iterable so if I pass in a list of strings then the iterable whose elements will be joined together are the elements of the list but if I accidentally pass in a text string to the join method then the individual characters of the string get joined together and this is usually not what we want so but that I just want to make you aware of a common beginner's mistake here so let's look at some more methods that we can use for example there is a replace method so a sentence as we just created it is just the sentence this will become a sentence and now that it is a sentence I want to replace the will become with an is and how do you do that I use the replace method so now I get back a new text string of course where the words will become are replaced by the one word is here and then sentence yeah and that's also important in the upper cell I called the replace method on the sentence object and this gives me back an output here where it says this is a sentence but because strings are immutable the original sentence object is not changed so that's also a reason why we do have a return value here we will see in a future chapter that in some methods in particular methods on mutable will behave differently but because the string is immutable it always returns a new string and it leaves the already existing object in this case a sentence untouched it cannot touch it it's forbidden we cannot change an object that is immutable okay then we have also seen before in chapter 4 when we process some user input in the guessing coin toss example that sometimes it is quite valuable to use the dot strip method on a text because let's say a user enters something and the user didn't see in this input prompt that he accidentally entered a space in the beginning or at the end this often happens then we want to get rid of the spaces here and how do we do that with strip so dot strip just gets away of the surrounding white space and then there are two specialized methods L strip and R strip which only get rid of the white space of one of the two sides but dot strip is commonly used when we have to deal with user input and we have to make sure that the user input is somehow clean and then we just strip away unnecessary white space and then some utility method that you may find valuable when you build a program that prints out a lot of intermediate results so for example think that you do some calculations and after every so and so many iterations of some big optimization you are running you want to give out intermediate results and you want to do that in a table so in a table usually the different columns they are aligned as columns so how can we do that with textual data here well I can call the L just a method for example to left justify text so what this does it takes an argument 40 and 40 means the entire string will be 40 characters long or 40 or more but usually the number that is passed in is bigger than what we know the string will be like the string will be I don't know what this is it's like 25 characters or so so if I want to left justify the sentence with 40 characters what python then does is it adds trailing white space and this makes the entire string makes the entire string having 40 characters and the spaces are just filled in and we can also do right adjust so we can put the entire text string to the right and this is often done for numbers so whenever you want to give out tabular data and the tabular data consists of numbers then numbers are usually right adjusted so that you see that the decimals from the right starting from the right are always at the same position in the table for example I will actually show you an example of such an output later so this is just to show you that these functions exist and then of course there is another nice method that we can often use to especially work with numerical data so let's say I have a float 42.87 and I want to add as many zeros leading zeros to the left side so that we have 10 digits displayed in total you know that you cannot use leading zeros because leading zeros are used as a prefix for the binary and the hexadecimal representations so what we do here is I call C fill with number 10 and it gives me a total of 10 characters and leading zeros such that the entire string is exactly 10 characters wide and this is often also useful if you work with numerical data in a tabular form so there is no reason for you and I have seen this to write code that you know adds a certain number of spaces to some string manually you can always just use those three methods and those three methods are usually just fine in order to make strings ready for output in a table and of course we can also do that for negative numbers and you see that it then leaves away one zero and uses that for the negative here so that also the second version here only consists of 10 characters here okay these were some common string methods there are many many more you will find in the documentation and now let's look at some other syntactical area now we saw the methods and now we look at what strings do when they are connected with operators so for example what happens if I take a slice the first four characters of my lower Ipsum text and I add it to the string hello so this may look weird in the beginning can I add two strings because so far we have only seen two numbers being added but of course I can also add two strings because the plus operator is overloaded to mean string concatenation string concatenation is when we have two strings and we add them together and then we get the string hello lower in the same way we can also multiply a string so here I multiply five with the first 12 characters in the text which means lower Ipsum so whenever you are in a situation where you have to fill in let's say dummy text with lower Ipsum you only have to write lower Ipsum once and you can just multiply it and you get lower Ipsum as you want and so it's also changed here so we first do some multiplication then we do the string concatenation here as well the plus but again the operators work with strings so the multiplication and the plus operators work so the minus operator of course has no meaning so the minus operator won't work for strings it's not overloaded for strings but plus and multiplication work okay so another kind of operators that work with strings are comparison operators that we first saw in chapter 3 so for example I can compare the string Apple to the string banana and what this does is it basically answers the question does Apple come before banana now the problem is there you know the characters as we will see later in this chapter they come in a weird ordering but for now if I execute this code cell I get the true and basically this tells me that Apple comes before banana in the alphabet and why is that well because the A comes before the B in the alphabet and how is string comparison done so what Python does is it does that in a pair in a pairwise fashion so it takes the first character of the left operant A compares it to the first operant of the first character of the second operant and if they differ then the sorting rules apply and A comes before B and because of that the entire string Apple comes before the string banana so as long as and we will see an example as long as the letters are the same then the decision cannot be made so the first character that differs will basically make the decision and then here's the thing that I just told what if I compare lower case Apple with uppercase banana the problem here is according to our alphabet Apple should still come before banana but it doesn't and so what you have to know is the comparison operators they do take into account sorting or how the strings are sorted just as if you were to compare the number one to the number two for example if you asked is one smaller than two then you will also compare the two orderings here but it does it so in a different way and the way will become clear towards the end of this chapter but for now no you shouldn't you compare strings where you know one string is uppercase and the other one is lowercase and obvious solution to that is to always lowercase a string before comparing then Apple still comes before banana and now to see to extend this here are five German names and I illustrate in this code several examples basically first this is an example of operator chaining so I can of course have more than one operator in a chain here and this order is actually true so if I execute the set I will get back a true so that means my with AI is smaller than all of the four names here and meyer the German last name with AI is smaller than those three and so on so how do the rules that I just briefly mentioned above here work in detail well those two the two strings here are compared and here the first letters are the same the second characters here are the same and the third character is the same here on both sides and then when one of the two operands is at its end it's a shorter one then the shorter one will be sorted before the longer one so that's a rule and then if we look here at the next two operands so meyer M and A they are the same on both operands but then the third character the I and the third character here the Y they differ and because it comes before the Y that's because meyer comes before meyer with the Y here and the same is true for the other two meyers here so this is how string comparison works it works on a character by character basis and the first pair that differs is the decisive one okay so now that was operators and now we have another topic actually so usually when you write code with textual data and you want to prepare it for output what you want to do is you want in your source code to write a string that basically looks like a template and then whenever you load actual data into your program and process data you want to fill in the template string with the actual data so how can we do that in Python so one way the most modern version of this is called the F strings or format strings how does it work so let's say I have two variables called name and time of day Alexander and morning and now let's say I want to print out hello the name and then you know good time of the day how can I do that with a template so what I do here is I prefix the string with an F the string is called an F string but an F string is syntactically speaking also just a string so it's nothing really different here and then within curly braces I just mention the expression that I want to be evaluated here so I just reference the variable here of course and if I execute this the string will say hello Alexander good morning and if I change the you know here Alexander to another name then of course if I execute the second code cell I will have a different name here so this is one way in which you can write a template string in your source code that will then be filled in with data coming from some other source how can we do this for numbers so let's say we have Pi here to like 8 or 9 decimals so how I can I can within the curly braces write a colon and then a dot to F and the dot to F basically means format this number or this variable with two decimals after the dot that's what it means and this is why I see the output Pi is 3.14 and there is a so called string format mini language that you can look up in the documentation options with which you can format your templates and then you can write strings in your source code that looks somewhat like this and parameterize the output so the second way of doing this is the format method so the format method is the most common one I would say and the F string method we saw before is the newest one and it's also the one that should be used going forward but because the format method is also very common these days we just quickly look at it so how does this work well I just write the string as a template but I leave out the curly braces so this is just like before and then I say dot format I call the method format on it and then in the same order that those curly braces are to be filled in I just pass in the variables and I get back hello Alexander, good morning again if I want to exchange the order I can of course put indices in there so here it's of course zero based indexing so here the name goes second and this is what I can do to if the order will be a bit different and then lastly the format method also accepts keywords so here I pass in the arguments by position as positional arguments and down here I pass in the two arguments by keyword by the keyword name and the keyword time and the keyword name and time is written here in the curly braces again and this is the third way of using the format method and it also works and then lastly just to show you how you could also format a number the colon dot 2f notation still works here for dot format called with a pi for this the second way when you read code on the internet most likely you will see the dot format method a lot and this is all the ways in which you can use the format method and then there's a third way and it's actually an operator it's the % operator which we understood to be the modulo operator for modulo division but in the context of strings the % operator is overloaded and it's overloaded to do string formatting in a way so what does this do so here is again my sentence pi is and then now we say % dot 2f so instead of colon dot 2f we say % dot 2f but that % is actually not the operator so that's just the string and then next to the string we have the operator % and before that in the example of string concatenation I had a plus before so that's syntactically the same as if I wrote a plus here it's just an operator and then I pass in the variable or the expression that I want to be filled in and this also works and that's the oldest way I would call it the legacy way don't use this way going forward but maybe you will still see it around on stack overflow when you ask some questions and then of course you can extend this for more than one variable and then all you need to do is you need to wrap the individual expressions that you want to pass in within parentheses so the parentheses here are mandatory and then here we write %s and %s means that Python formats the values of those two variables here as as strings basically but again as I mentioned those two cells you shouldn't write cells like this on your own I just mentioned this here because you will definitely find a code like this on the internet when you look for answers to your questions on for example stack overflow okay and now let's come to another topic another big topic and the question is we have seen what strings are and we have understood that strings are basically a collection of characters and now I want to focus a little bit more on what is a character so we want to really focus on an individual character right now and how are characters modeled in memory actually so let's look at some examples we have maybe seen before I haven't talked about this but if you go back in the lecture here to the beginning of this chapter you will actually see that at some of the strings that I showed you actually ended with a backslash n here and now here in this string we have several of those backslash n's in here so what does the backslash n do well if I evaluate this expression it doesn't do anything the only thing that happens is quotes will become single quotes but the backslash n which is in here remains in here I think you can already guess what the backslash n does it stands of course for new line and if I print out exactly this expression what happens is the line breaks are actually shown so the string itself in the way it's represented it only stores the backslash n as one character so this is actually even though we type it as two characters for Python this counts as only one character and when we print out the entire string this character gets printed in the way that we are used to printing new lines namely by going to a new line and so only when we use print we see the effect here so what is the backslash n here well backslash is again the escape character so whenever we see a backslash string this means that the following letter or character will be interpreted in a different way namely in a way that usually does something that we couldn't show otherwise usually it's an so called unprintable character so how would you print a new line up here right you couldn't do that but to express the idea of a new line we use backslash n here in total the backslash and the following character together they are called or referred to as an escape as an escape character okay let's look at some more backslash n is new line and here I have backslash b so what does backslash b do well backslash b is a character that moves the cursor one to the left so now I print out a b c and then I move the cursor one to the left and I print next character x so what does that result in well it results in an output a b x so the c is gone and the c is not really gone the c was actually written but then because we use the so called backspace character backslash b we move back one space and then we write over the c that existed and we write an x and this is why we only see the x and so if you ever wondered if you used some tools in the terminal window sometimes the cursor does weird things or moves in a way that is kind of intuitive then usually how the programmer does that is by using special characters for example like the backspace here so you can actually use that to move a cursor in an output in a terminal window but you can also do this here and then of course there is another one backslash r and this is basically this basically means we go back to the beginning of the line so this is the line return or carriage return that's what it's called so I want to print a b c and I go back to the beginning and I print x I get x b c here and in the same way if I write x y c here and I run the cell again so sometimes when you read in data from a file and the file contains the backslash r character you may read in content that does not get displayed or something so you have to know that these kind of special characters that they do exist but they are definitely not part of day-to-day programming I would say one more example this is maybe something that you can actually use if you want to output some data in a tabular fashion you can use backslash t which is at the tab character so we jump in tabs here so here as you can see here between the words we only have one character it's not many spaces we only have one character the so called tab character and this may be useful again if you want to display some intermediate results some long running calculations that you do and you want to do that in a nice tabular way then maybe the tab character is helpful here just a quick aside raw strings so let's say I want to print out here the typical path of where on a window system an application is stored or installed so let's say I want to print out c colon backslash new application what happens well unfortunately I have a line break here because of the backslash n and no the backslash p doesn't mean anything so backslash p is okay but the backslash n results in a line break and we don't want that here really right we want to print out the entire path to the file but we don't want to have a line break in the same way let's say I want to print out some other path on a window system that lies under c users administrator and so on let's try to print this and now I all of a sudden get an error and why do I get an error well I get an error because backslash u means that the following characters and we will see an example soon will be what is called a special unicode character and python is not able to interpret the s e r s that follows the unicode character so because of that it gives us actually a syntax error as we see so we don't want that so what could you do if you really wanted to print out something that has a backslash n or backslash u in it well what you could simply do is you could just escape all the backslashes so all the backslashes to become now two backslashes and if I do that then we get the output that you want so that's one way but there is another way and this way is to prefix the double slash here and a double quote here with an r and the r stands for raw as in raw string and what the r does is it basically is an instruction to python that whatever is inside the double quotes will be given out just as you typed it so the backslash n will not be a new line character here but it will just be backslash n and in the same way the backslash u here won't cause any troubles we will just give out the path as we wanted to okay so now let's ask the question what is a character so we have seen characters in many ways we have seen you know most of the characters they just work as if we type them on a keyboard and then we have seen some special characters like n or some other escaped characters so escaped characters they have a special meaning when printed out but what are characters okay characters in a system called ASCII are nothing but numbers so for example the character uppercase a is somehow mapped to number 65 so you may wonder why is that the case you have to think like this just as an integer number as we saw is consisting of the individual zeros and ones of binaries we go now one level higher one so we add one level of abstraction so to say and just as integer numbers are made up of individual bits a character is now made up of individual numbers so we just move one level higher here and so you may wonder why is it the number 65 is there any reason for that and the reason is that well in the early days like in the 1960s some people needed to standardize which number is mapped to which character and many standards were created around the world and the one that survived was the so called ASCII standard and ASCII is an American standard so it only works for American letters or English letters and in most or in many other countries around the world other character sets or other ways of encoding were created but then the question is how do we or what are the implications of that so the implications is if someone maps the number 65 to letter A and someone else uses another number to map to this letter then if we transfer files then whatever one person writes cannot be read by the other person because the letters they just don't don't match up so this has to be standardized and the ASCII standard is the one that survives obviously and we can also go in the other way so ORT and CHR character these two functions are built in functions that we can call and the ORT function we can give any character any string it must be an individual character and then we are given back the integer number that this letter is mapped to and the other way around the character of 65 so the number 65 is mapped to the letter A and then of course the special characters that we saw like new line they are also mapped to the number 10 here and also vice versa so all the characters that we have seen are indeed mapped to a single number and it is standardized according to the ASCII standard so what is the ASCII standard? so the ASCII standard works like this the digits and this is now an important distinction the digits 0 to 9 that we know they are mapped to the numbers 48 through 57 now you may wonder why is that? well there is a simple trick the trick is like this we don't want to use as many bits as possible so maybe what we would like and I maybe go back here so maybe we don't want to have a large number of bits to model individual characters so let's say we want to only use 8 bits or 1 byte why do we want to do that? well we want to do that because otherwise we would maybe waste memory now the question is how many characters can we model with 8 bits and the answer is 8 bits or 1 byte can express the numbers between 0 and 255 so we can express at most 255 different characters and so in the early days people had to look ahead to have tables where they had to look up what an individual sequence of 0s and 1s means in character wise which number does it express and which character is it mapped to so the people in the early days they were rather smart or in other words they were rather lazy they didn't want to memorize a whole lot so they tried to use some tricks and one of the tricks is that the first the numbers 1 so the first 32 numbers they are called they are mapped to what is called control sequences and what are typical control sequences well for example the new line or the backspace you get the point these are all characters that we don't see on the screen however they need to be mapped somehow and so the people they just mapped them to the first bits our first 32 integer numbers and then starting from the number 32 onwards they wanted to be smart and in particular they wanted to be lazy so why did they start why did they map the digit 0 to the number 48 well if we look at here the binary representation of the number 48 then we see that the last 4 bits they are all 0s we can basically read this as that's the number 0 and then the digit 2 the digit 1 is just the last 4 bits we have 1 so this is exactly the same ones and 0s that we saw in chapter 5 and this goes on until the digit 9 and the digit 9 is of course the binary sequence 1, 0, 0, 1 so in other words in order to express 10 different digits you need at least 4 bits so and then of course what they did is they had to fit this into the system somehow and because they made up that the first 32 characters are the control sequences then they had to like find the first number that would basically allow a 1 to 1 translation of the bits here into real numbers so that's why we started number 48 here so in other words the numbers 48 through 57 are mapped to the digits 0 through 9 because of this binary system ok so here again it's just a simple for loop and I print out the binary representation just to make this point and this is just to show you the reasoning behind let's take the next step now I want to show you the uppercase characters from A to Z so the alphabet has 26 characters so in other words the character A will be the first character and we already see where this is going so if you look at the what is it I think it's the 5 yeah it's the 5 the 5 the 5 least significant bits in the binary representation 0, 0, 0, 0, 0, 1 this is a 1 in other words this would be the number 1 as we learned it in chapter 5 and because this means the number 1 we map it to the letter A because it's the first letter in the alphabet and then the letter Z here is the 26th character in the alphabet and the sequence 0, 1, 0 is the number 26 in binary so this is why the numbers 65 through 90 so 65 through 90 are mapped to A through Z and then one last step and here you see for the smaller characters the lowercase characters and the only thing that is different between the lowercase characters is that we have they are prefixed with 2 ones here and the uppercase letters they were prefixed with 1, 0 here but the 1, 1 here only means that we just add a constant basically to the number so we shift all the lowercase numbers 1 to the right in a way and then A is still the first letter of the alphabet which means the 5 bits here are still the number 1 and so on and then note here how I used the R just the right just method to make the output nice so the 79 here, the 98 here and the 99 here is nicely adjusted to the right with the R just method and a value of 3 here so this is what I meant earlier when I said that the R just and the left just and the C fill methods they are commonly used to make tabular output in your program look nicer an example of that so let me just write this down the numbers 97 through 122 the set is omitted here because my output is a little bit shortened will be mapped to the letters A through set lowercase and now the question is where do we put all the remaining symbols like a dot, comma, semicolon and so on now put in the in the whole so to say of the system so from numbers 32 through 47 we have symbols I call it symbols 1 here and then starting from 58 so 58 here through what is it 265 here or 64 this is the second part of the symbols and then we have more symbols here and then more symbols here so the dot character and so on they are just put in between the whole so to say and this is why I have the output here and here I use R just actually twice so I adjust here first the number the symbol is mapped to and then I also write adjust the binary representation because as we see the binary representation gets longer for the higher symbols as well so this is a way to make the output look nice so what's the learning of this the learning of this is that the entire world now has its first 127 symbols or 7 bits of information actually mapped to the same characters and they are standardized and it goes back to some people being lazy and figuring out a smart way of doing this but then the question is what do we do with all the other characters for example I am a native German speaker so in Germany we have umlauts if you go to a Scandinavia they also have special characters if you go to Czech Republic they also have special characters some countries like China for example they have a language and some symbols that are not in the letters as you see although there is of course simplified Chinese but the original Chinese does not cannot be expressed by just 26 letters or you know twice as many and so on so what happened well in the beginning all these countries that had different languages they had their own standards and that is not good in the global world you want to exchange information between all the nations and people that speak different languages and we want to have a unified standard of how we can express text in a computer program and the standard was invented around the 90s and in the early 90s the first version of what is called unicode was basically released so what is unicode unicode means that any character that ever existed in the history of humankind was assigned a number and this number is called its code point and the number is a number between you know some hexadecimal digits starting from zero and going to around about 1 million just over 1.1 million actually so that's the unicode standard so that's it basically just as we had the numbers from zero to 127 here mapped to the ASCII characters we now have more than 1 million numbers mapped to more letters and the unicode standard by now contains all of the languages that are spoken around the world plus more languages like Esperanto for example which used to be a language that was made up for everybody to learn so we all speak this language it never got popularized but it still exists some of you know some science fiction movies like for example Star Trek and there is an alien race called the Klingons and they also have a language and this language is also in the unicode standard because some nerds got it in and of course we also have other languages that we know for example mathematical symbols so in math we have our own notation and these are all and the notation in math they are also considered characters and they also are mapped to a unicode a code point and because of that we can express any character we want if only we know what is the code point and in the chapter you will find links to some of these tables and for example the smiley characters they are also mapped in unicode and here you see what a unicode character is we started with backslash U that's an escape character so backslash and uppercase U means that the following 8 digits are interpreted as hexadecimal and this is one of the reasons why I put a little bit more emphasis on hexadecimal in the last chapter so even though it seems a bit awkward to study at first for a data science practitioner let's say if you want to work with textual data and we want to express letters other than the American letters then we should at least know what hexadecimals are and in some of the unicode tables that you will find on the internet you can look up a character for example the smiley character and then the character will be mapped to exactly one number called the code point and this number exists of course in decimal notation but also in hexadecimal notation and you need the hexadecimal notation in 8 characters or in 8 digits and if you put this here then we get a smiley luckily all the unicode characters also have a name by which they are uniquely known for example another smiley called phase with tears of joy is named phase with tears of joy and we can use its name put it in between curly braces and proceed that with a backslash n and backslash n is also unicode but it means now we use the name so if I evaluate this expression I get back a different smiley here so again any character that has ever existed in the history of humankind you just have to find out either its name or its number and then to go back to the normal characters like for example the character A we know that the character A is mapped to the number 65 in the ASCII standard and the hexadecimal representation of 65 is just 0x41 and because we need 8 digits here we have to write those leading 0's and then end with 41 backslash u in front of it and because of that we get back a unique letter that we could also type rather easy on the keyboard but let's say we wouldn't have a keyboard here where we have an A on it or the A key doesn't work we can always write the A letter like this and whenever a unicode character can be expressed in just 4 digits we could also do is we could just write backslash lowercase u and just put 4 digits also 0 0 4 1 here to also get the A and then for those letters that can be expressed with the numbers from 0 through 255 which means 1 byte of information we can also just write backslash x for hexadecimal or a byte basically this will also be an A so at the end of the day everything can be mapped and depending on how high the code point number is we either use the uppercase u or lowercase u or just the x and also note that the first 127 symbols and unicode standard are identical to the ASCII code so whenever you can express something in ASCII then you could just as well type A but ASCII as the predominant standard is just the smallest set within the unicode standard okay so what does that mean for the string type well if I ask the question how long is the letter A how long is the text that contains only the letter A the LAN function gives me back a 1 but now the question is let's say if I print a unicode character snake in the python snake how long is that character so if I call the LAN function with this character I also get back one so in other words python is smart enough to figure out that all these characters here the backslash n and the snake here that this basically means only one character and at least it means to us as a human being character that's the semantic meaning that is important and the LAN function with the string type is aware of that so the LAN function always gives us back the number of characters in the string and not the number of actual letters we had to write in other words python is aware of unicode and so on okay now that we've covered what is unicode let's quickly just cover how can you write in your source file text strings that are on more than one line of code so for example you cannot do this if I have a text string like this it works but if I want to however break this text string into several lines in the source code then I get a syntax error I cannot do that so there's a solution to that so triple quotes open triple quotes here and closing triple quotes here they work just as single quotes or single double quotes I would say however within triple quotes we can break the line and what happens is the line breaks they are stored in the string and actually just to illustrate this point if I output a multi line just as it is then it's of course only one expression and it was translated into backslash N this is what this does here so in other words whenever we want to make our source code a little bit nicer in particular if you want to write text that goes beyond the 80 characters per line that we don't want to exceed then you can always use triple quotes or triple double quotes and just break the line it would also work with single quotes so you could also use triple let's do that triple single quotes one, two, three and this would also work of course but again the convention in this book is that we use triple double quotes here and this is a nice way to know and of course you may remember that for the functions we wrote we always documented them with dog strings and dog strings are nothing but normal python strings just written with triple quotes because the dog strings are usually written over several lines of code and now I can do this I can loop over the multi line string and I can see that this string indeed consists of four lines of code so you may wonder why are there four well we have a line break here and then we have a line break here we have a line break here we have three line breaks then you must have four lines so this is why the first and the last line is empty we need to be careful with that and how can you change that for example by just adding the strip method here if you add the strip method here then what happens is the empty lines go away so this is why you often will see the strip method in source code especially when we use multi line strings so this is just nothing new so this does not allow us to do anything new it just allows us to write the source code in a maybe nicer way and now finally this is the last section in this chapter today we will talk about the bytes type so what are bytes well I already told you that in a computer everything is ones and zeros we call that bits and groups of eight ones and zeros are called one byte and one byte is usually a unit of account in which computers communicate so if I go on the internet on the web browser and I go to some website then what happens is my laptop will basically make a request to a web server and this request is sent in pure bytes so over the network and along the way the all the routers on the way that make my request go to the web server they also think in bytes and then the web server it receives the bytes and it has to decode the bytes so a byte is really only there is nothing fancy it's just a collection of eight bits and it's just a unit of account so just as you know sometimes if you are a large corporation you report your balance sheet in millions of dollars or millions of euros you don't report them in just euro or dollar so why do we do that because we are just used to having a larger unit of account for large corporations and just as this is a unit of account bytes are just a unit of account too so let's see how bytes usually look like so here I have a file it's called fullhouse.bin bin for binary and I open it in my text editor on my laptop in sublime text and when I open it I see a weird thing I see always groups of four and only some letters and some numbers I don't really see any other information here and now if we look closely we will see that it's always the digits from 0 to 9 and the letters from A to F so obviously what we see here in this text file or in this text editor is hexadecimal notation and hexadecimal I told you is just a nice way to express many bits because one hexadecimal has the same expressive power as four bits so we can basically have a shorter way of expressing the ones and zeros notation and now I want to open this file in python how do I do that I use the open built-in and I specify the path and now I specify mode as rb and r means I want to read this file only and b means binary so I'm basically telling python hey I know that this file is not a text file it's a file that consists of bytes and I want to open this file and then what I do is I call on the binary file here the dot read method and dot read without a number specified will read the entire file so what I can do here is I will just read the entire file and put it into a variable called data so now let's look at data what is data it has a type and the type is bytes and now what's the value of data the value is this and if we go back to the text file or to the text editor so the first the first two digits are basically f0 and 9f and if I go back into python it is f0 and 9f here and they are preceded with backslash x and backslash x we learned is basically a special an escape character to indicate that the following two digits are hexadecimal so this is now data that we got and this could be data that we are sent over some network or this could be data that we got from basically opening an image file for example so jpeg or other image files often only contain binary data so we cannot open a jpeg file or dot mp3 file for example for sound data in a text editor it doesn't work and the reason is because image data and sound data is usually stored in bytes on disk and if I read in such a file I would see something similar to this so I would just see the raw bytes that lie in this file and now whenever you are given just bytes one way well you can actually look at it, you have to analyze it and bytes they are basically also a sequence so they have a length I can loop over the data stream here so I can loop over the individual bytes and note here that by looping over the individual bytes they are given out as integers and the integers are always between 0 and 255 so if I go back the first digit here is the hexadecimal f0 so let's go ahead and check what is f0 we just write it as hexadecimal f0 and f0 is just the number 255 in hexadecimal so and by looping by looping over the data, the bytes object I obtain the individual bytes on a one by one basis and an individual byte is nothing but a number between 0 and 255 and of course I can index also from the end so I can get the last byte and the last byte is just the number 158 which would be the same as the hexadecimal 9e here ok, but this is binary data and I can also of course slice it, I can only take every other byte and so on just as if we had an ordinary string object except that I cannot read characters here right? so what can I do with that? well I told you that every character is mapped to an integer to some number and the standard to know here is the so called unicode standard and then the question is how are the numbers that I told you that the characters are mapped to some numbers on disk well they are basically stored in bytes so that's why I have the bytes section here and then whenever I have the raw bytes on disk I have to like decode them into numbers so here again we have the numbers that are mapped to some characters and basically on this side we would have the individual bytes that are mapped to some numbers that are decoded basically so in other words the numbers that are here they are in the bytes somehow so how do we get to the numbers? well we can always try to decode the data so whenever we have a bytes object we just decode it with the decode method and then we will see that what we get back I stored it in a variable called cards what we get back is a string so in other words python was able to into a text string and it seems to be like a deck of cards here or let's say 5 cards that resemble what is called a full house so that's why the file is called fullhouse.bin because it contains the characters the unicode characters that make up those 5 cards here so what did we learn here? well I only want to make you aware that the bytes format is a very generic way of exchanging data and if you want to go into data science you will one way or the other encounter data that is stored in a bytes format most notably of course binary data would be images or video files or audio files and then we have to decode bytes into usually numbers and when we know that the bytes that we are given is actually textual data and then we can try to decode it into let's say the unicode numbers and once we get them back to unicode numbers we can immediately tell which character they are ok and we can also go the other way so let's say I have a place where I want to drink coffee which is called kaffe kastanchen trotchen this really exists by the way and let's say I want to go the other way and make the text string a bytes object I can of course call the dot encode method on the string and I get back the following sequence of bytes and what I see here is the following the first three characters which are c, a and f which are characters that are in the ASCII set they are just encoded with themselves in a way so it just says c, a, f here and then the e apostrophe on it is encoded as 2 hexadecimals and this is true because as I told you all the ASCII characters they can be expressed with numbers from 1 through 127 which means 7 bits of information or 8 bits if we just give it one more bit and that means every character that is an ASCII character can be expressed basically as itself but then all the other unicode characters that cannot be expressed with one byte of information they must be stored on disk in more than one byte and in particular the e with the apostrophe on it is stored as 2 bytes of data and this is how a computer thinks of it so for a computer the letters a through z in the American alphabet 8 bits and other letters are usually stored with more than 8 bits in particular in this case with 16 bits or 2 bytes so why would that be important well, oftentimes still to this day you are given data that is not encoded with unicode or some encoding called utf8 actually so for example I have heard of this standard here let's say I encode place with this other encoding here I get back a different result and the letters a through z are encoded in the same way as before and just the German umlauts here the u and so on they are encoded in a different way and you have to basically know the encoding that was used of the bytes data and you cannot really make sense of the bytes and the default encoding that we use is the utf8 encoding which goes along with the unicode standard and one remark here as well so whenever we encode something with a given standard then we also have to decode it with the same encoding so for example if I leave out the argument of the decode this is as if I set an error here and I get of course an error here but if I decode the encoded bytes with the same encoding it always works so I get back here the correct letters here okay and let's do the error again and then of course some of the letters cannot be encoded with a certain encoding so for example this here is the jack language and the jack language also has some special characters and I try to encode it with the iso standard here with the iso88591 standard and iso8859-1 here is the western european standard and because the jack language is not considered a western european language we will get an encode error here so up here note we have a decode error and down here we have an encode error so because that means those special characters in the jack language cannot be encoded with this encoding here so I didn't go into too much detail here with encodings because that is something that we don't want to deal with however sometimes we get files that are encoded in an encoding that is other than the so called utf8 encoding and so what does that mean so let's look at an example let's look at the example a file called umlauts.txt I opened this in my text editor and I opened this file and it seems as if something went wrong here because I have many questions marks in here and of course there is something going wrong here so if I open the file umlauts.txt I get a decode error so we cannot really open the file and the reason is because we chose the wrong encoding and usually this means we have to ask the person that sent us a file what did you encode the file with and if we know the correct encoding you can specify encoding in the open function and then all of a sudden it works and then we see that now this is all the German umlauts that got lost in the version before and now we can read it but just by knowing the correct encoding ok and the best practice is to use an encoding called utf8 and this is basically the encoding that gets used anyways if we don't specify it but the best practice is to always specify the encoding to utf8 because that is the encoding that is most commonly used and this encoding has the property that it can actually encode all the unicode characters so we have seen before with the Czech language and also the German umlauts that some encodings they cannot deal with all the letters but the utf8 encoding can actually deal with all the letters and that is why we should usually use that if we can and if we are given data that doesn't work with that maybe we should first ask a source that gives us the data maybe to encode data with utf8 because that makes life a lot easier and here this is to finish the chapter today we have here many things together at once so let's say I go back to my original example of the Lorem ipsum text file and I want to read in the entire contents into one variable this is often what you want to do you want to open a file in one variable how do you do that? well first we always do that with the context manager to make sure that the file always gets closed and then the best practice is as I said to always specify encoding is equal to utf8 on linux and mac machines this shouldn't make a difference but often times when you run this code on a windows machine and don't specify an encoding things can go wrong this is due to the fact that windows tries to make guesses about files encoding in a different way so always better to always specify the encoding explicitly and then what we do is here I use the join method that we saw earlier and I use the read lines plural here method on the file object to read in the entire file at once so we need to be careful here it's not too big if the file's contents actually fit into memory and then I join the entire strings because read lines returns me a list of strings and then I join all the strings together into one string and I store it in a variable called content and then I can for example print out the content and note that the new lines are respected here so the new lines still are here and I can also if I just look at what would be content if I don't print it out it would be just one string with all the new line characters in it so this is just what the string would look like and of course because it's very long we cannot really look at it and it makes more sense to use the print function okay so I want to quickly wrap up this lecture so what did we see here what are the takeaways the string type is the default way of modeling text in Python that's usually that's always what you should use we saw that strings are sequences and we will talk about sequences in chapter 7 very extensively and so what are sequences, these are any type types in Python that have 4 properties the 4 properties are that they are finite so they have a limited number of things in it and the things here are the characters we can loop over them that means they are iterable we can loop over them in order or also in reverse order and yeah and they are containers so they contain other things and the things that a string contains are the individual characters of course so that's about strings and then because of all of that we can do nice things with strings indexing and slicing we have many string methods we learned that every character in a string is a unicode character that's important because that means we can basically write anything we want in any character from any language in the history of humankind the downside of this is that the more letters we use that are not in ASCII when we want to save them somewhere on disk for example they have to be encoded and the encoding happens in the same way as the stir object is encoded into a bytes object and the bytes object is basically one way to model data that we do get in an unencoded way so that is the standard that is used on the web and then once we have decoded the bytes and we are given a string then we can just work with the string in terms of just characters and this is a level of abstraction that is nice for us as data scientists because as data scientists we don't want to work with the bytes too much but we should still know a little bit of how bytes work and then what else have we seen we have seen of course how to open files I think that's a valuable thing to know always make sure that you use the context manager here and how to store files or how to write to files that's also very easy to do however you will find many tutorials on that on the internet as well and I think this is it and then see you soon