 Good morning everybody. Today we are going to discuss string data. How do we handle string data? We have seen so far some string constants that we use in our output statements primarily. To facilitate knowing about what input is expected and so on, we put some character strings in the C out. But we have not yet figured out how we could actually store such character strings inside the C dumbo memory and handle and manipulate that data. So we shall look at the string data, the input output and the manipulation. The second part of this lecture, we shall spend some time on understanding the notion of time complexity of our algorithm or in simple words, the cost of computation has measured in time. Because all of us presume that the computers are very fast and therefore it does not matter how long they take to solve a problem because we believe it will be a few microseconds, milliseconds or a few seconds. Unfortunately that is not the case. The whole lot depends upon how we write our program and how we design our algorithm. So we shall discuss that. String is nothing but a sequence of characters. So we are familiar with string constants. We say hello world in C out. We can have names. CS 101 is a string. So string is nothing but a sequence of characters. We would like to take this string data as input into our variables so that we should be able to manipulate. Obviously the manipulation cannot mean the conventional arithmetic operations because you generally do not think of adding strings, multiplying strings, dividing strings, etc. But there are meaningful operations which can be conducted on strings and we shall see how these are constructed. There is one unfortunate problem. The traditional method of handling string data inside the C plus plus language emanates from the older implementation of this feature in the C language. The C language when it was designed, it handled strings without giving any special data type. A special data type means a specific understanding just as we have integer type data, float type data, double type data. So we know for example when we declare a variable to be integer, it can house any value between 1 to whatever 2 to the power 31 minus 1 or 2 to the power 16 minus 1 and so on. There is no such provision for string data type in the basic C language. So what therefore transpires is that there are no special operators as well because we have numerical data types. We have operators like plus, minus, multiply, divide, modulus, etc. There is no data type. There are no operators. These used to be traditionally handled by putting each character of a string into an array location. So you effectively had an array in which you stored individual characters. But to understand that it is a string, you would usually put a null character at the end. It was an extremely artificial and contrived view of storing and manipulating strings. But then at that time that was the only way. Since some of you have already studied programming, you might be tempted to think about strings in terms of the conventional C programming string. That is a no-no. So at this stage we will not use the C type string handling in our programs at all. We are in fact going to use a feature which C++ introduced much later through special extensions as a part of its class library. We have not yet spoken about class. We have not yet spoken about objects in a class. In due course we shall discuss that. In fact, this is the reason why C++ is called an object-oriented programming language. Suffice it to say at this stage that as far as we are concerned, we shall consider objects as normal variables. There is therefore a notion of a string object obtained through C++ class libraries, the special standard library called a string library. And that gives us the facility to treat a variable almost a variable of the type string. So for the time being we shall treat these objects as string variables and we shall see how we can perform operations of these strings using familiar notations like the familiar operators that we have in the sense of arithmetic operations. Of course with an appropriate meaning. For example, there cannot be any multiplication of two strings etc. But we shall see shortly what we can do. So here are some examples of declaring string variables and assigning values to such variables. Please note an important inclusion just like you include IO stream which is standard. You have to include string. When you say include string, effectively our C++ Dumbo accesses the huge collection of the class libraries which pertain to the operations of string. And therefore Dumbo becomes more knowledgeable. Basically Dumbo understands integer, float, double, etc. Once you say include string, he will also understand string variables. The way you declare string variables is exactly as you declare integer or float or double variables. You say string and then you give any name str1, str2, str3, whatever x, y, z, abracadabra, whatever. So these are the variable names just like you have other variable names. And inside these variable names you can store not numbers, not fractional numbers or decimal numbers but strings. So for example when you say str1 equal to computer, then any subsequent usage of str1 will result in the value computer being used. Naturally when you write string constants, you put a double apostrophe here and double apostrophe here. But inside when the machine stores it, these apostrophes are stripped off. They do not exist. So consequently the string which gets stored inside str1 for all practical purposes is c-o-m-p-u-t-e-r. Exactly these characters. I could say str2 equal to programming which means this character string constant is stored inside str2 which is another variable. And here is a simple way of outputting you just say c-out str1 followed by a blank followed by str2. This will print out the value of str1 which is computer will be followed by one or two blanks that you have indicated here followed by str2 which is programming. So you will get computer programming as output. I think this is very clear and natural extension of our notion of variables. One of the important operations that you would like to perform on strings is to combine different strings. So for example I might have declared these three variables str1, str2, str3. In str1 I am storing the string hello world. Notice that the length of this string will be 1, 2, 3, 4, 5 which are the 5 characters in hello. 1, 2, 3, 4, 5 which are the 5 characters in world and a blank character. Although it appears blank, it is a character. It has a symbolic notation. So 5 plus 5 plus 1, 11 would be the number of characters in this string which will get stored to str1. Just like when you declare an integer you can stuff a 1 digit number, 1, 2, 3, 4, 5, a 2 digit number, 31, 48 something, 3 digit number, 4 digit number up to a certain limit. In exactly the same fashion the string objects of C plus plus or string variables can also store a variable number of characters in this string which is stored inside. The point is the variable treats that entire sequence of characters as a single value which is the string value. So here I am assigning two different strings to str1 and str2. Now I can actually do an assignment operation str3 equal to str1 plus str2. Notice that I am using the plus symbol which is actually an arithmetic operator, addition operator. But when used in the context of string it becomes a concatenation operator. You cannot add strings normally. So what it means is you take the first string, you take the second string, juxtapose the second string against the first string and the resultant long string that you get is the result value. So consequently str1 plus str2 will actually mean hello world how are you. So notice that when I say see out str3 is this I will get an output of this kind. Str3 is hello world how are you. Notice some problem in the concatenation. Notice that while I assign this string hello world I had kept a blank in between so that it is readable. I did the same thing between the words of the second string how are you. But when I concatenate they get concatenated exactly juxtapose to each other which means there will be no blank between world and how. Now this is not exactly the kind of output that you want and therefore you might either want to put a blank here or you want to put a blank here or you want to say str1 plus some string plus str2 where that string is blanks. So for example here I have put a comma and a blank as a value of the first string and consequently when I concatenate these I get hello world comma blank how are you. All this is common sense. There is much more to the manipulation of strings later on we shall understand how we can dissect a string. Suppose there is a string of 200 characters how we can say give me 10 characters starting from 45th character. All these things become meaningful when you have to analyze text given as input. The arithmetic operator plus can also be used in an increment fashion where we can say v is equal to plus equal v plus equals say x what it means is v is equal to v plus x exactly the same meaning is employed when you use it in the context of character string. So when you use the plus equal concatenation it means that if you have str1 str2 where str1 is hello world str2 is how are you you can say str1 plus equal to str2. This will also concatenate str1 and str2 but instead of putting the result in third string is what we are doing in the previous program this will put the result in str1 itself. Consequently str1 will now change to whatever str1 and you will get exactly the same result. How do I handle input output? Well there are multiple ways of handling input output of character string the simplest is to continue you to use the C in C out which is part of the IO stream actually. So what the IO stream does in fact the input output operator is not C in or C out these are actually file names just like we had the roll marked file dot text. So similarly C in is the name of a standard input file C out is the name of a standard output file. The actual operation of putting something into the computer or putting something out of the computer is done by this greater greater or less less symbols these are called insertion operators. So the files are like a continuous stream in which you insert strings or there is a continuous input stream from which you pick out extract strings. So that is exactly the meaning of these insert operators. So here for example when you say C in greater greater str1 greater greater str2 greater greater str3 three words will be picked up three strings will be picked up from the input and they will all be assigned values to str1 str2 str3. Consequently if I say C out 1 colon str1 2 colon str2 3 colon str3 then I will get value 1 2 3 followed by the three values of str1 str2 str3 which I actually just read it. The reason I am elaborating this is to make you understand some peculiarity of the input. Since C in less less means standard input which is when you type in on the screen you will observe when you say C in n you have to give a numerical value and you say C in x you might have to give a floating point value. Similarly when you say C in str you have to give a string value. So if I execute this by compiling the string input dot cpp and execute dot slash a out and suppose at input I say how are you in a single line how blank are blank you. Now unfortunately that C plus plus dembo processor when it encounters this line then it dissects this line into what it presumes to be three different strings and that is because a blank or a line feed is supposed to make a new entity whether it is number or string or whatever. Consequently you will be able to give how as the value for first string are as the value for second string and you the value for third string. There is no simple mechanism to say how blank are blank you together is a single string which should go into a single variable. You can make that through assignment you can make that through concatenation later but no simple input mechanism through C in greater greater exist. Of course there are mechanisms which we shall study later when we consider formatted input output. So here is an example if I reexecute the program and I give only one word at a time on one line I type how press return type r press return and so on I will still get how are you. So whether I press return or put one or more blanks in between that is considered to be a delimiter between two consecutive values whether they are strings or numbers it does not matter. But this applies strictly to C in with insertion operator greater greater greater greater etc. There are other mechanisms which we shall discuss later. Here is another example I give how blank are on first line press return and then you and then press return. I will still get how assigned to first string are assigned to second string and you assigned to third string. There is another important manipulation that you can do you can compare character string. You can say you can check whether a character string is larger than the other character string. Now when you are dealing with numbers it is very easy to understand the numerical comparison 10 is greater than 5 127.28 is less than 238 etc. How do we define smaller or greater in the context of character string? For this each character that is representable inside has been allocated a unique internal number and the characters are arranged in ascending or descending order of those number representation. This is called encoding the characters are coded. There are two very standard codes which are used to represent individual characters inside the actual dumbo which we are not yet talked about but later on we shall examine the real memory and so on. But such an arrangement is called a collating sequence. In a collating sequence one character has a lesser numerical value than the other character. Consequently in any string at a given position if there is one character and given same position in another string there is another character then the collating sequence value or order determines which one is smaller or which one is greater. The two standard encoding which are used very heavily are called ASCII and APSEDIC. ASCII is American standard code for information interchange and APSEDIC stands for extended binary coded decimal interchange code. So APSEDIC is a nomenclature and a codification which is used by IBM machines traditionally at one time IBM made 50% of all the computers made in the world. Now they make only a small percentage of all the computers in the world and apart from their main frame computers even they follow ASCII. So ASCII encoding is now more or less standard globally. The ASCII encoding actually uses an 8 bit value or a value between 0 to 255 to denote a character. We shall see these implications later. Suffice it to say that collating sequence permits me to make comparisons between different character strings. So for example D will be greater than C because all alphabets are actually ordered in the same collating sequence as we are familiar with. A is the smallest, B is next, C is next etc. Q is less than Z. If you are comparing two strings, let us say str1 is given value mined, str2 is given value yours and you say if str1 is greater than str2, str is equal to str1, otherwise str is equal to str2. So what is the value of str in this case? Which one will be greater than mine or yours? Yours because it starts with y and y is greater than m. If the first character exactly matches, then the string comparison goes to the second character and so on. Very logical. As you would see alphabetical ordering in a dictionary in exactly the same fashion, the ordering determines comparison. I have rewritten the program that you had seen or you will see in your lab. We had discussed this last time. Now here last time when we had seen this program where we had data for roll numbers and marks. We had deliberately collected integer numbers or actually numerical values because we knew how to handle them. So we had declared arrays for storing up to 100 roll numbers and 100 marks and we are reading data for about n students for which we are giving the value or figuring out the value. But now that we know how to handle character strings, it is now possible to build a record which consists of roll number, name and marks. So that is what I have done. I have now, apart from roll number and marks array declared as in, I have declared a string array. String student names 100. It is exactly like integer something something 100 to 500 or whatever. So there is no difference. All that it means is that I am preserving 100 locations to store 100 different strings, each one of which is accessible by using the name of this array and an appropriate index. Please note that while the value inside will be a character string, the index itself is a pure numerical value. So you manipulate index with i, j, k whatever and you will get access to either this string or that string or whatever. So if I continue with my programming like last time, I will say give number of students, I will give number of students n, number of students at 10, 15, 17, whatever. To read this, I will carry on with variety of other things. You already have this program. It is there on the web. You can look it up. All that I am indicating is when I read data into the array for 0 to n students minus 1, 1 by 1. I expect one record to be given on one line. Earlier I was giving only two values, roll number and marks. Now I will give roll number, name and marks. So notice this. My file greater greater roll i, greater greater student names i, greater greater marks i. It expects three values as input for each execution of this statement. First value should be roll number, second value should be name, third value should be mark. And I can output the same thing here. So here is what I will get if I put in so many roll numbers, names and marks. You will notice that the roll numbers and marks are very similar to what we had seen in the last example. What I have done is I have inserted names which is name 1, name 2, name 3, name 4. I have given these artificial names so that I know that in the original list, this was the first name, this was the second name, this was the third name. You know which name is associated with which roll number. Subsequently when we will discuss salting or rearranging arrays. For example I might want to arrange this entire list in descending order of marks. Suppose I want to allocate grades. While allocating grades, the roll numbers or names do not matter. It is only the marks that I wish to see in descending order. However, once I allocate grades, then for those marks I have to give that grade to the corresponding student. So it will not do if I merely rearrange the marks and get this largest number 90 up here. I must also get name 6 up here and I must also get 1508 up here. All these three things must move as a single component. We are not discussing salting now but this is just an example. So this is the rudimentary discussion that we have on strings. Subsequently the salting which I mentioned, we shall discuss this in the lecture on Monday. What I want now to discuss is another very important notion of computational time. We all presume that computers work very fast, very rapidly. Of course they are capable of executing millions of instructions in a second. But it does take a finite amount of time for every computation. In fact within computation it is well known that multiplication takes longer than addition for example and so on. We shall see the exact integrities of such details later in the course when we study the hardware concepts. Now we need not understand all that. But only thing we need to understand is any computational operation takes time. And if we make Dumbo do a whole lot of operations, Dumbo will take that much time. Consequently it is important that we are careful about how we write our program. We must not make Dumbo carry out unnecessary computations. The measurement of how long it takes to execute a particular program or in other words if there are two programs doing similar things, which one is faster. This is measured by the order of magnitude of time which is required to execute a program and that is known as time complexity. At this juncture we are only introducing a term time complexity without fully understanding its meaning. But what I have stated in this line is roughly the meaning. It is the order of magnitude of execution time that a program will take. To illustrate that this is not a trivial issue I have constructed an example the outcome of this discussion should be that you should be extremely careful when you write your own programs. So far you have been more or less executing programs that have been given to you by making small modifications. But soon you shall be writing complete programs on your own in which case it is absolutely essential that you keep the number of computations to the minimum possible. And to show how drastic the problems could be I am going to show you an example. So here is an example where I want to estimate the value of pi. If you look at this pi is related to a circle. So if you look at this circle let us say the area is unit area the radius is 1. If radius is 1 then the area is pi r square which is just pi. Consequently if I could estimate the area of this circle then I know the value of pi. But how do I estimate the area of the circle? Well one simple thing is to construct a square across these two radii. So this is a unit radius therefore this is a square which is value 1 square. Now I can see that if I can estimate the shaded region that is the area of the what you call the quarter circle. The area of the full square is known it is 1 into 1 or whatever the length into length length square that is the area of the circle. And if I can estimate this area then I should be able to get the value of pi. Now here is a simple technique of doing it this is called discretization. I arbitrarily choose a point within this square. A point may be here. Here is an example. This is one point for example. The point is here inside the circle. This point is outside the circle. This point is inside the circle. This point is outside the circle. If I take sufficient number of points across this entire square and count those which lie inside the circle then I get an area estimation. Now how will I know whether a point is inside the circle or outside the circle? That is relatively simple. So assume that this is the circle. Instead of one unit I consider n units. Why n? Because I will say I have discretized this region into n different portions. So 1 to n. If n is thousand that means I am considering thousand by thousand points. If n is 5000 I am considering 5000 by 5000 points. If there are n units then the coordinate x coordinate of any point here will be i by n in the unit circle. So I will say 1, 2, 3, 4, 5, 6 let us say i count up to n. 1, 2, 3, 4, 5, 6 I count j here up to n. So this is my i varying from 1 to n. This is my j varying from 1 to n. Any point which is represented by i, j is actually a point here or here. How will I know whether it is inside the circle or outside the circle? Well, a simple property any point which is inside the circle its distance from center will be 1 or n if n is the radius. So if n is the radius then any distance any point which is away beyond n is outside the circle below n is inside the circle. How do I know the distance from the center to any point? Simple the square of that distance is equal to the square of this distance plus square of this distance square of x coordinate square of y coordinate. So x square plus y square is equal to n square and if x square and y square together is less than n square then I know I am inside the circle. More than n square I am outside the circle. Consequently this is the formula that I take the quarter circle whose area is pi by 4 for each point i out of n and j out of n count those which fall within the circle and this count divided by n square is obviously pi by 4 which is the area of the quarter circle. This is a simple program. So here is the program estimating value of pi. I put integer i, j, n and count. Count will represent the number of points which are inside the circle. I input n, n could be 1000, 2000, 5000 as I said. In fact we will examine what happens when we give different values to it. Once I get this n this is actually the algorithm. What is the program? I start with count equal to 0 and I set up a nested iteration for i equal to 1 to n minus 1 to n. So that means each point on x axis for every point on x axis for j equal to 1 to n all points on y axis. So I am examining at a given i value all values of j given next i value all values of j etcetera. And the condition that we wanted to measure is simply written as i square plus j square should be less than or equal to n square. If i square plus j square is less than equal to n square that means the point is within the circle I should count it. If it is not I should not count it. Consequently whatever I count at the end that divided by n square which are the total number of points will be the ratio which will give me pi. So this is the double iteration that I have set up. All that I am doing is if i square plus j square less than equal to n square I increment count by 1 otherwise I do nothing. At the end of all these double iterations when I come out I will have a certain value of count. Four times that value divided by n into n I will get pi. We agree it is a simple algorithm. What we are interested in not dissecting this algorithm at this stage further because it is conceptually simple. But what we are interested in finding out what happens when we execute this algorithm. Please note integer i j n count and the algorithm says i square plus j square less than equal to n square. Here is a quiz. We have declared our variables i j n as integers. The effect on the estimation of pi will be if we use integer declaration will it have any problem? The choices here are first choice is there will be negligible impact because we have declared pi as float and we declare pi as float we expect all computations to automatically happen in floating point. Second option is no it will be very large because the division operation in the final formula count divided by n star is of the type integer divided by integer and therefore I may lose complete meaning out of the calculations. The third option is still yes it will be very large because the values of some terms may be beyond the limits of integer representation and d none of these. So anybody for a yes a few people think that they will be negligible because we have declared pi as float. Anybody for b? Large number because we have said count divided by n square and count and n square both are integer therefore we believe that we will lose the result. Anybody for c? Yes very large because the values of some terms may be beyond the limits of integer representation. People are not very comfortable assuming that this will be the case. So they neglect this. Anybody for d? No. Surprisingly there is one. Let me tell you the correct answer is c not b. You have forgotten that the total formula for pi inside the division was of the type what was it? Something multiplied by count divided by n square and that something was float. So when you have float divided by integer float multiplied by integer divided by integer the sequence in which operations happen is left to right. Float divided by integer is actual float operation and you get a float value. That divided by integer is again a float operation. So consequently the argument b will not be valid. Argument c is indeed valid consider the limit of n a limit of an integer number and consider if n is very large. Because n is just about 1000 less than the maximum value that you can represent. What will be n square? It will be beyond representation. So you will get consequently you will get very many wrong values particularly for larger values of n and you have to be very careful about crossing these limits. That is the purpose of this. Here is the modification. So I had float pi alright but I now declare i, j and n all as float and I declare count which is an integer to be long. So I preserve the precision but I have very large value. With this I execute the algorithm. The algorithm is exactly the same. The execution gives me this. There is a mechanism which was not discussed so far. You normally execute a program by saying dot slash whatever file name that you give to the object file. But instead of the default name a, if you give minus o option this becomes the object file. So I have said time dot slash part. This time command actually does nothing but executes the command that you have given later but puts a timer. The moment execution starts and gives you timer results when the execution stops. This is the easiest way of finding out how long a particular program took for execution. So here is the result. I give n equal to 1000. Value of pi is this. Real time this. User time this. System time is this. You can see all these are zeros. I have discussed users real later. I again executed this program for n equal to 10,000. This time you get a value which is perhaps closer to pi but you will see that real time is 3.69 seconds. User time is 1 second and so on. Give 20,000. The user time increases to 4.3 seconds. Give 50,000. The user time increases to 26.71 seconds. Notice the difference. From 20,000 to 50,000 you increase the count only by 30,000. But the time has increased from 4.304 seconds to 26.71 seconds. I am deliberately mark this in blue although there are three timings. You have three times here. Real user says the real time means the total clock time. If you start with your watch, the real time displayed will be the actual clock time the whole program took. The user time and the system are the actual components where the computer spend time on doing work for you. And these are two independent time counts. User time is actually the computing time taken by our C++ Dumbo. So your program execution time is user time. User does not mean you as a user how much time you spend. User means see Dumbo as a user of the computer system. But he refers to a main Dumbo called operating system Dumbo. And he takes some time in supervision that is called system time or system time. Consequently the system will generally be very little. The user time will represent the true time that you have spent. And the total real time is actually time that you take for giving input, thinking whatever whatever. So just note down the time is going back to the back slide. For 20,000 4 seconds and for 50,000 26.714 seconds. We now notice that in the computation that is happening here i square plus j square plus is less than equal to n square. This computation is being done iteratively inside. So this will be executed n square times. Now i and j are changing in every iteration. But n is not. n is actually constant because it is declared earlier. So there is no reason why I should make that Dumbo compute n square again and again and again and again. I might as well compute n square once and use that computation. Consequently I have a variation version 2 where I say int n2 is equal to n square and use n2 here. Notice that I have reduced the number of computations in a substantial. Consequently when I run this I expect to find significant difference. However I notice that when I run it for n equal to 20,000 I still get around 4.3 seconds user time and when I run it for 50,000 I still get about 26.898 seconds. This contrary to my expectation there is no appreciable change. Why should this happen? It is another case. The last one for the day. The execution time for each of the two versions is not appreciably different because A, multiplication does not take very large time. It is the division operation and the addition operation which is time consumed. B, since i and j are varying i square j square each takes much longer than n square computation. See our Dumbo somehow figures out that n is not changing. So it calculates n square only once on its own even in the earlier version and uses only that value. And D I do not know and also I cannot guess. So let us have a quick poll. Anybody for A multiplication does not take very large time? Nobody for A. There are two people. You are wrong by the way. Multiplication does take long time much longer than addition. Anybody for B? Yes. i and j both are varying. n is not varying. So somehow each takes much longer than n into n. Anybody for C? Yes. Some people for C. Anybody for D? There are some. I am happy that there are some people in D. By the way the correct answer is C and not B. Please note that no matter how frequently i and j are changing at the time of the evaluation of that expression inside it, at that moment i has a fixed value, j has a fixed value and has a fixed value. So therefore there is absolutely no impact on what they have changed from and where they are going to change to later. At the time of evaluation each one of them has a fixed value and therefore the computational time should be same. Since by reducing n square completely by a fixed value we are still not seeing an appreciable reduction, you should therefore conclude that C must be the right answer. If of course somebody tells you that the right answer is here. It is indeed the right answer. What we said by writing n2 equal to n square initial n using that the C plus plus dumbo is also smart enough. When it analyzes your program translating into its machine language it finds out that this fellow n is not changing in the loop. So why do I compute it again and again? It will compute it once and use that reference value always. This is called optimization which dumbo does. That is the reason why you did not get this advantage. However a much better strategy is to realize that instead of estimating the area under curve for the full square I can estimate it half of it which is triangle because I am examining symmetric numbers. After all i, j and j, i will have exactly identical behavior either inside or outside. So consequently if I make this modification that i varies from 1 to n but j varies from i to n. So now I am estimating the area inside a triangle and finding out points which fall within the circle and consequently I will not be estimating one fourth but one earth and I will get pi as 8 times the count by n2. And indeed when I see the results I will see here that I take only 17.216 seconds. We will stop here.