 continuing with our effort to design a arithmetic data path in implementation in VLSI. Last time we have seen the general introduction to data paths. We also have seen the requirement of high speed data paths. We also have seen that in current context, not only the speed is of relevance, but it also that we have a constant on the power. So, the next few hours of my talk on arithmetic, we shall also first look into the basic structures of data path blocks and then go over to say how to design a low power or low voltage arithmetic in the current requirements, particularly for 45 nanometer down technologies. Though the example may not actually work on 45 nanometer models, we will say if you scale down the standard CMOS process from say point 130 nanometer down to 45 or below, the technique will remain similar and we shall be able to reduce the power anyway. We are discussed last time that there are number codes, different kinds of codes can be used for different chip implementations, but the simplest of all which allows the simple arithmetic to be performed is of two kinds. One is in which the, one is called bit serial. In the bit serial, the data actually appears bit by bit. Even if you have 8 bit data, it will require 8 clock cycles to enter the data and if you have larger number of bits, you will have to that much time to enter the bits. The advantage of bit serial is obvious that since the hardware is single, let us say one full adder is required to add to numbers, but you will require 8 bits of 8 times this addition has to be performed. Since the hardware is minimum, so you can certainly say it has a saving on a silicon real estate and therefore the cost. The second possibility that you do bit parallel, that means you actually introduce almost all the bits simultaneously into the adders, but obviously since you have 8 bits or 16 bits, so as many adders will be required for each bit to enter or each combination of bits to enter and because of that, the hardware requirement will be very large and since you are parallelly inputting the data, it seems to many and actually as I say later it may not be true, but it is normally one believes that the parallel bit processing will be faster than the serial bit. Then in reality, if you really do their implementation on VLSI, it is found that the ratio in speed in the two cases is very small as parallel implementation, actually though it looks to be simpler, but it requires a carry propagation and in larger the number of blocks you both bits you go through, the carry propagation time increases. So, in reality one finds that the speed of a parallel processor or parallel adder or parallel multiplier is not much different, it is definitely lower than the bit serial processor, but it is not as low as one thinks. Obviously, of course the area of bit parallel chip will be larger because you have to have large as many word length times you have to do it, because if you have 8 bit data you require 8 processors or 8 adders or 8 multipliers. So, obviously you have a large amount of hardware going on and therefore area is very high compared to bit serial. And also it has been found since you are anyway going to do operation even in bit serial 8 times and in bit parallel you are going to do simultaneously 8 times. So, in power dissipation case not much is different between the two. However, looking at the two kinds of hardware available or bit serial bit parallel adders or multipliers available, it is a prudent in thinking that if I can do what we call as hybrid kind which is series parallel chip implementation which may be easier to optimize than either of bit serial or bit parallel. Typical hardware looks something like this, you have a bit serial arithmetic and bit serial arithmetic is relatively simple. You have a full adder circuit shown here which receives inputs x i y i and it goes gives you an output s i and a carry out which is c i. Now, the way it is the first bits are entered first and you generate a carry and this carry is now required to be fed back for the next input bits x 1 y 1. And since you require 1 clock cycle to enter the next bits you also have to now delay your carry through a flip flop to return to the input of adder. So, that they actually are in the phase with the next input x 1 y 1 and this process continues as long as the input data stream appears at the input of F A. Now, mathematically speaking we will see little later again, but the sum could be expressed as x i x r y i x r c i minus 1 and c i which is the carry output is essentially majority of x i y i and c i minus 1. Another interesting circuit many of the operations in the case of arithmetic you require is to subtract a data from one from the other. We know once a complement technique or two's complement technique does allow that simple thing to happen, but even without doing complement data you can just have x i y i bar as the input to full adder and by this the output will be then the differential difference of the x and y. So, this subtractor adder circuits are similar except that one of the input which you need to subtract from the other one has to be given as an inverted input. Now, if you look at the structure of this if you look at the bit parallel arithmetic correspondingly one can see from here it is a full adder which is shown similar like this except the part which is not fully shown to you here is that if you have a a i b i and i is may be 8 or 16 or 32 or 64 for each such a 0 b 0 a 1 b 1 a 2 b 2 up to a i b i you will require each of them one full adder and then everyone has to generate a sum and a carry and the carry of the first full adder block is to be fed as the carry to the next full adder block which receives the next bit a 1 b 1. So, obviously there is a carry propagation going on from adder 1 to adder 2 to adder and so on and so forth. Typical expressions which almost everyone of us is using is that it is s x a x r b x r c i or c 0 as the 0 input carry and c 1 please remember generally if it is the first full adder the input carry is normally 0 because there is no number initially coming from anywhere. However, if it is in between this c 0 will be actually received from the last full adder stage. The carry part is essentially is r of 3 and gets a dot b a dot c 0 b dot c 0 we will express them in some other form little later once again though I will right now wrote itself that s i c i can be expressed in terms of what we call generate and propagate signal, but let us wait before we come back to explain this. The other possible arithmetic other method of actually implementing a logic or arithmetic adder or multiplier is essentially done through a distributed arithmetic which essentially is what used in almost every processor. Your input data is actually loaded on the ROM which is say if your n bit data you have a 2 to the power n words of ROM capacity each such first bits are transferred to adder and initially if there is no carry which is coming from there then the first some will be actually outputted and a carry will be fed back. This is essentially 8 bit or 16 bit register each carry is fed here in 8 bits or 16 bits and that register after a clock will transfer through another ROM which is storing the data and will fed back to the new data which is appearing and this process of feedback which is essentially what I say bit parallel and this is how the data can enter this can be bit parallel and this is the add subtractor block which may be a single block or may be a parallel block and or may be combination of the two and since you are actually putting register and to shift that and also you have the memory to store it we say it is called shift accumulator implementations. This is what typically what we do in real life. So, we start with the adders in specific now and we like to see what is the advantage of actually using hardware what kind of logic style we should use for adders and please remember from the processor point of view we keep saying that the system should be more programmable where as in the case of BLSI we are more interested to see the operations are faster that is speed is higher and we are also interested that the consume much less power than what one thinks and probably possible even the area be minimized. So, there are three parameters in BLSI we optimize the power the speed in the area of course may not optimize all three together, but these are the specs on which will I will actually test or benchmark every circuit will think about and check whether in a specific requirement which adder can be used. Here is a typical truth table for a full adder circuit shown here you have a full adder you receive an input A and B and you have an input carry and your output carry and your sum and this gives truth table and you know this is very interesting this is called carry status we will look into this this how we will going to implement actually the hardware. If your A A B the data input is 0 0 and the carry initially 0 we know some is 0 and they know you do not need any carry and this happens as long as one of the bits become 1 this sum cannot become 1 because 1 plus 0 1 plus 0 1 it is only occurs when one of the bits of the three becomes 1. However, if there are two bits becoming 1 at in time the sum will go to 0 1 plus 1 0, but it will then generate a carry and by same logic we keep saying this interesting feature which I will come back again once again when I come back to this word D P G whatever I am talking here this is essentially other terms is called kill this is called generate this is called propagate. So, one can see very carefully that if A and B are same A is 0 and B is 0 if C i is 0 or C i is 1 the carry still remains 0. So, essentially I am trying to say that irrespective whether carry is 0 or 1 as long as A is 0 and B is 0 the carry is not needed it will always be deleted that is if it was initially whatever carry it will become 0 or if you look at the last two where again A and B are 1 that is equal, but they are 1 now the carry will always be required as 1. So, we say you need to generate a carry which is delete means going to 0 or kill going to 0 and generate means you have to create create output as 1 for the carry. In between if you see very interesting thing you can see independent of what A and B you can see in between as long as they are not both together 1 this 4 cases 0 1 1 1 1 1 0 1 0 1 can see from here that if C i is 0 C 0 is 0 C i is 1 C 0 is 1 C i is 0 C 0 is 0 C i is 1. So, what is essentially happening that C 0 is same as C i or to say in other term C i propagates and therefore, having seen this table I think one can get a good idea what kind of circuits we should implement which for given a data will do the following operations. Of course, this is essentially arithmetic and therefore, can always be represented in arithmetic form. So, typically a binary I had shown here which is what the expressions will show A XOR B I XOR is the sum which is coming of A and B and if you expand XORs we will get 3 this is sums of product and the output carry will be essentially sum of the 2 bit terms of this which contains 2 C i terms 1 A B terms and if you have a larger this numbers A B C D then you will have larger sums and carry terms available. Now, before we go ahead I would like you to see a very simple implementation of whatever I said so far. I have a logic representation of an adder which is a 2 bit adder A is equal to B S I have A B S and C as my A and B are input some S is sum and C is carry and the basic idea as I say is S is A XOR B XOR C N. So, we have 2 way of doing it we do partial addition or what we call half adder and if you have a half adder that means you only do A XOR B as the sum as assuming that input carry is not existing in that case we only have to sum without carry inputs. So, we have sum is nothing but A XOR B. So, a simple circuit of XOR which receives 2 input A and B can give a sum part and a carry of course is then the AND gate of A and B which is nothing but A dot B. However, if you have an input carry then you have to do 2 XORs now A XOR B is one function then you have to XOR it again with C N as I shown here A XOR B XOR C N will give me the sum, but the C N has can be expressed in simple other form is A XOR B dot C N since A XOR B has already been created by me I can put an AND gate with C N as another input as one of the term in C out the second term is nothing but the A dot B. So, another AND gate A dot B and the OR of these 2 will give me the carry. So, if you see very simple way you see you are 2 AND gates 1 OR gate 2 XOR gates can implement the full adder terms in real life. Now, why are we interested to look into this? The reason why I am interested in that how do I implement them in real circuit way? Assuming that we are continued to use only CMOS blocks you can use by CMOS you can use even bipolar, but as I say the current technology and may remain current for many more years is a complementary mass technology. So, I talk about the implementation of a simple adder on a circuit way on a standard CMOS as we have seen in our first block that the major component in a full adder or anything is 2 XOR gates and then of course, is the easiest to implement as we know in the CMOS. So, I only show you the major block in the full adder part is XOR and I have given you some different styles or design styles in which one can implement this. The first one of course, is a static CMOS that means for each N channel device there is a P channel device and the way I implemented is that I have 2 P channel devices 2 series P channel devices receives input A bar and B and parallel to them I have another 2 P channel devices in series which gets receives input A and B. Please remember these are just complements B and B bar A bar A, please remember these are cross because in XOR you need A B bar plus A bar B. So, this is what essentially we are trying. If you look at the N channel part static part for that you have A bar B bar as the series combination and A B as another series combination of 2 N channels transistor in series and then they are paralleled and one can see this 8 transistor circuit can implement a simple 2 bit XOR. Now, as I say it is a static and we may like to compare them with other style and verify for the VLSI implementations which one of this style one should use. Please remember for a larger bits of size for a larger speed or a low power you may have to choose among them as per your requirements, but let us see optimally which one is better. The other possible option which is shown here you can see I already said you have 8 transistors here for a static for an XOR. A equivalent XOR block can be implemented by using what we call complementary pass logic or a just pass transistor logic. Basic idea is simple it is we are implementing a multiplexer and we know a multiplexer can implement any function and since we are implementing A B plus A bar B bar of that as XOR please remember the XOR can be mathematically written is A XOR B is equal to A B. Plus A bar B bar bar essentially this is a XOR, but the bar of that will become XOR. So, if I want to implement this simple XOR all that I do is I use 2 pass gates which receives an input 2 pass gates inputs A and A bar and they are controlled by B and B bar please remember A B and A bar B bar in the same sense for the same transistor. Now, since this is a max obviously B is 1 only then A passes if B is 0 then A does not pass, but if B is 1 B bar is 0. So, A bar does not pass if B is 0 B bar is 1. So, A bar passes. So, conditionally saying A B is passing here and A bar B passing here and this is called wired OR and once we do this then invert of this can be actually A plus B bar. Now, the problem with all pass transistor logic or called CPL logic is since the transistor does not have any power supply of course, you need an inverter here otherwise for the function implementation, but this requires a larger size itself because if you have to drive this line ahead then you require power to be given from somewhere and this inverter will actually provide you power through V D D. There is also a problem in this kind of this please remember there are capacitances associated with this nodes. So, when B becomes 1 let us say and A has to pass and let us say A has A was 1 and initially you are 0 here. So, obviously this 1 to transfer and charge this capacitor will take some time and a larger the capacitor smaller a larger will be the time. Also one can see from here since it is a pass transistor when A let us say I have an input which goes from 0 to 1 volt or 0 to let us say I have this transistor has V D D of let us say 2 volt 2.1 volt and V T is 0.8 volt example no problems. So, when input goes from 0 to 2.1 or goes from 2.1 to 1 this and B let us say goes from 0 to 2.1 fully. So, initially when this was 1 that is 2.1 volt and you have say 0 volt here till 0.8 does not pass because the transistor is I mean 0 1 minus this when it reaches around 2 point please remember if this is 2.1 volt and you are going from 0 to 2.1 volt. So, till 0 it will pass because this V G S is much larger than V T, but when you reach 2.1 minus 0.8 which is roughly 1.3 volt then a little ahead of that you can see V G S minus V T is less than V T or V G S is less than V G S minus V T is smaller than this. So, the transistor turns off therefore, the maximum output at this will be 1.3 volt even if the input was at 2.1 volt. So, there is what we call a 1 threshold drop occurs across a pass transistor. So, with CPL logic problem is you do not reach the full signal and to restore them and also to dry you need an inverter. So, you need a power source here and you also need a restore of the level which essentially come from an inverter. In the case of to avoid some of these problem there is an interesting circuit has been suggested and this is called double pass transistor logic which alleviates of course, the enchanter transistor shown here also have two more problems. One is called feed through due to this capacitance here the data output is always connected to the input which therefore, changes the V G S minus V T on this and because of that there is an issue of charge sharing goes on. This is called feed forward issues. Charge sharing feed forward are two major worries in a pass gate. Third of course, is what we solve KT by C noise which cannot be minimized. However, if we want to reduce this feed through and charge sharing problems there are other methods which are suggested. One of course is you can directly put a P channel transistor across and put B V bar and similarly across you put B with an P channel transistor. But that essentially does not give you full control. So, the modified CMOS pass transistor logic or complementary logic has been called double pass transistor logic is shown here which is very simple. You have one P channel transistor in parallel there is an N channel there is one P channel transistor which has N channel parallel to it, but they do not receive same inputs and that is interesting part. The first one which is a P channel input is a B, second one which is an N channel which is driven by control by B input A and the same now is the opposite complementary that this is A bar. So, this is A, this is B therefore, this is B bar, this is B therefore, this is B bar, this is complement of that. As we did complement here same is complemented here. Now, in this case in either case the level restored is always there, but then since one of the transistor will be on for a while 0 will be passed through this and BDD will be passed through the other one and therefore, at any time these levels we reach here will not have ET drops. Also since this is a CMOS kit or some kind of a complementary part in parallel the charge sharing can be now done through either this or this. So, loss of charge can be provided by one to the other therefore, there is no charge sharing real problems. Feed forward is a problem, but you can see it is a compensated problems and therefore much less feed through problems appear. So, pass double pass transistor logic which is actually is very interesting logic can also implement an XOR which is shown here. Please remember in this case you have two transistor for A B bar and A bar B bar and two inverter transistors to four transistor. We have 8 here, 4 here in this 2 plus 2 plus 2 that is 6 transistors. So, you have 4 transistors, 6 transistors and 8 transistors. So, you can see clearly that the area of this is the smallest, next is this and the largest is this. There is another dynamic possibility dynamic logic can be implemented and in dynamic logic one of the most famous logic we know is the domino. This is called dual rail domino there is a possibility some books think of single rail, but I am just showing you the one which is most logical and most easy to pick up which is called dual rail domino. Now, in the dual rail domino basically what we are doing in a domino as we know the logic is passed on it is a it is you know domino has a two stage one is called the pre charge mode the other is called evaluation mode. Evaluation is essentially on the n channel logic, whichever you want to implement and the pre charge is during evaluation the ground must be provided by the lower n channel transistor. During pre charge this is of our p channel should pre charge the output this is what we are actually looking for. Now, this dual rail domino has shown here this is a inverter and I am feeding it back to a p channel it is called weak p. The reason why we do this that let us say when I am pre charging when phi is 0 p channel is actually turning on now this becomes 1. So, this becomes 0 and this turns on. So, this essentially is trying to bring this level full v 1 because you have whatever next time and you will actually evaluate when phi goes to 0 please remember when phi goes to 1 this will become 0 this will become 1 this will switch off, but since it is called weak p it is not same sized this will leak some and will try to keep capacitor charge even then. So, this is a very standard technique and as you see how many transistor I need 2 plus 2 4 plus 2 6 7 plus 2 9 transistors. So, area wise this is very very larger than any was starting also. However, when we came out with domino obviously, we have we know that it has a dynamic circuit and one of the major feature it allows you that at no time p and n are fully on or even sharing the current. Therefore, the static current is very low in the case of dual rail domino or what we call the leakage current can be minimized in the case of dual domino particularly a 45 nanometer note down technology is now leakage is a major worry. So, probably we may have to look into dual this dual rail domino once again with much more sincerity than what we have done so far. Now, we can do some comparisons of different styles this is a standard 1.21. This is 2 micron technology magic was used for layouts and a standard spice 2 G was used to actually simulate this different blocks which I shown here with these kind of technologies with supply of 3.3 and the frequency of chosen is 100 megahertz. We can see from here these are this table gives me the parameter of my interest in VLSI and these are different styles which I can implement in XOR. This is still XOR and AND based total adder circuit requirements and the circuits I have shown was only for XOR, but similar thing you can do for NANDs and NORs or AND and ORs and can therefore this is total adder circuit parameter evaluation. So, if you look at power delay or worst case delay energy which is power into delay and area which is the area is normally expressed in widths and microns simply because W into L L is common to all of them. So, you can see W 1 plus W 2 plus W N into L is the total area. So, we always express per unit length or per unit channel length and therefore generally the widths are actually expressed total width requirement is what we actually talk about in area. Please remember area will be W 1 L plus W 2 L plus W N N E transistor in series or parallel you have to calculate that and the net W is essentially W into L net W into L is the area. So, we only express in W because L is constant for a technology. So, for example if you look at static the power is 34.3 milliwatts, the worst case delay is 2.33, energy is 79.9. Now please remember the worst word comes from two reasons. One is for a given input data if the current required is larger for a given data then you see it will consume lower power but or it is some kind of a critical path where the current available is smaller so it will charge slowest. So, for a different input data whichever is the slowest is called the worst case delay and that is being specified here. So, I repeat power is 34.3 average power this is 2.33 nanosecond is the delay, energy is 79.9 microjoules and area width is proportional to width is 27355 microns. If you look at the CPL or PTL the power is 34.5, delay is 2.24 nanosecond, energy is 77.3 microjoules and area is 17400437 microns. If you look at the double pass transistor logic the power is 27.5, delay is 1.98, energy is 54.5, area is 191.7 and if you compare it with the dual rail domino the power is 82.5, micromilimats, worst case delay is 1.78 nanosecond, the energy is 146.9 and area is 32,445. It has largest number of transistors so obviously you can see larger the number of transistor, larger is the area, smaller the number of transistor smaller will be the area. Now, one can see from here if you carefully look at this table if you are only looking for high speeds only looking for high speed high performance obviously the dual rail domino has the highest speed. But if you look at the power requirement is excessively high 82.5 millibatts correspondingly if you see just the energy product of the 2 is 146.9 microjoules. Whereas, if you look at the static comparison wise it has a power is 34.3 which is much smaller than dual rail, it has a delay which is larger than 1.78 to 0.3 nanosecond. But the energy is just around 18 microjoule compared to 146 and area of course is slightly smaller because it is 8, 9 transistors in per XOR itself we saw. So, it is 27000 compared to 32000. So, if in general you can see both of them both as well as speed as well as on the area and also on power are not very suitable comparatively. But on power scale if you only look you have a CPL which is 34.5 and you have DPL which is 27.5. Correspondingly delay if you say the delay in of course lower than both static and this 2.24 but this is even faster it is 1.98. If you look at the energy therefore is 77.3 which is close to static much smaller than dual but DPL has 54.5 microjoules and the area between past on CPL and DPL is not very different which is hardly few 100 few 1000, 2000 micron extra or less than that. So, what is essentially seen from this unless and of course the speed is the only criteria for you which is the high performance circuit probably you may use adders which are of using dominoes or dual rail dominoes in case about that the cost of power which is obvious. In case you are looking for optimal design to a great extent which is what most circuit designers will prefer then either use a CPL or use a DPL and comparison between the 2 you can see from here this is slight smaller area this is slightly larger area but the you get much better speeds much lower power comparatively and therefore the double pass transistor logic is always chosen as a candidate for an optimal adder design requirements. So, I have already said one of the criteria of choice of any hardware comes from the fact that you may have a different architectures of adders but each block in that will have certainly XOR and OR gates. So, which ones to use or which style to use we already compared them now by numbers and we feel that is always possible to optimize whatever criteria one is putting for a VLSI chip. Coming back to what I was discussing I think why I came back why I did in between I thought that before I go to the actual architectures let me tell you that any architecture can be built around one of those design styles. Of course, there are few more styles which I did look into one is called using zippers. Zippers are advantages it is slightly better than dynamic or domino logic but then it requires 2 power supply lines which may require little extra area to run everywhere and also at times there is a issue of levels not reaching fully because of the clock is driving it and at that time you may have to require further restore which may consume power. However, let us look back to more architectural viewpoint of an adder. So, how do we actually implement an adder in normal case we say we actually now have 3 kinds of function we actually represent first for any inputs a and b. Please remember a and b only 2 bits in the same that it is a i b i so a 0 b 0 a 1 b 1 and that can be 8 bits 16 bit 32 bits 64 bit any numbers. So, I have not written i specifically but you assume that this is each bit of that large numbers are being represented here. We define 3 functions this case one we call a function called generate is written as g sometimes some books write small g I have written capital and the capital G or small g whatever it is written is a and b. So, it is generate function is nothing but a b then there is another function we want to define and that definition is propagate function and we say propagate function is nothing but a x or b. There are certain alternate definitions and there are certain alternate implementation in which propagate function was defined as a plus b for our case as of now we will define a x or b and then there is a delete function sometimes called kill function in some some of the hardware which I copied from journals and papers they may have a data signal called kill which is essentially same as delete which is nothing but a bar b bar. So, now what we are trying to say that if you generate this function for a and b which you can because if I know a and b which is your input data I should be able to know g I should be able to know p because a x or b is known a b is known. So, is I have a kill function or delete function which is a bar dot b bar. So, these can always be simply evaluated by a simple blocks by the design style which I have already explained. So, I can generate g I can have propagate function a x or b or p and I can have a kill function or delete function a bar b bar. So, once I do this then I know the output carry which is nothing but a function of g p please remember what was an output function c 0 is that a x or b c i into a b is what we said about. So, same thing is it is a function of both generate function as well as the please remember it is carry a x or b x into c i plus a b c i is the function a b a c i p c i is the another function three terms you become there. Now, they can be represented by one simple function which is g plus p times c i. So, a b plus please remember I said it a b plus a x or b c i is the c 0. So, this can be represented as g plus p c i we know some is nothing but a x or b x or b x or c i. So, p is nothing but a x or b. So, I one can write some function as p x or c i. So, now I know that if c 0 g p is g plus p c i and if s and since I know my functions g I know my functions p and I if I need d of course, is not needed shown here, but if I need d I have this function is also known to me. So, we can always say based on knowledge of data I can I know this and therefore, I actually know what will be my sum and c 0. This is the table which I shown you earlier you can see from here very interesting thing before I go ahead. If a and b are 0 g 0 which means c 0 only p c i if g is 0 a and b is 0 independent of what c i you have if a and b are 0 this term is 0. So, output carries nothing but p times c i if a and b are 0 a x or b is always 0. So, p is 0. So, whatever c i you have the output carry must go to 0 that is what the kill that means initially whatever carry you had on the output it must become 0 and if you see this table the first two terms a and b are 0 whatever is the carry input 0 1 the output carry is 0 that is why I call it delete. Now, if you look at this function once again sum sum is nothing but p x or c i. Now, one can see from here that if a x or b for the input I say a 0 and b 0 then the that is 0 a 0 and b 0. So, the p is 0 that is 0 since p is 0 the sum will be 0 x or c i if c i is 0 output is 0 c i is 1 output is 1. So, if you look at this it depends on a 0 b 0 if c 0 output is 0 c is 1 the output is 1. So, this truth table which we wrote is essentially representing the function which I wrote here look for the other one which is more interesting propagate function in the propagate function a and b are either 0 or 1. So, if a is 1 and b 0 or a 0 and b is 1 g is still 0. So, this term is still 0 a is a x or b certainly will always be 1 because one of them was 1 and not both. So, p is 1. So, if this is 0 p is 1. So, what we are saying c 0 is same as c i. So, if you look at this table again for this case when a 0 1 0 1 1 0 whatever is c i here in the four terms you can see c 0 is exactly same. So, what we say input carry propagates if a is 0 and b is 1 or a is 1 and b is 0 whatever is your input carry actually propagates to the output carry. If you look at the some part since p is 1. So, 1 x or c i is now decided because if c i is 0 the sum is 1 if c i is 1 the sum is 0. So, we can see here if c i is 1 the sum is 0 if c i is 0 sum is 1. So, if c i is 1 the sum is 0. So, what is essentially saying that the truth table is much easier represented a much easily made circuit. If you can generate this function independently or may be last one we can see if a is and b is are 1. So, this is 1. So, g is 1 a and b is 1 means p 0. So, this term goes away. So, c 0 is nothing but 1. So, you can see from here. So, if a and b are 1 c 0 is 1 irrespective what c i you are because p is 0. So, p i c i will term go away. So, a is a and b both 1 will require your output carry to be 1 and therefore we say you have to generate 1. Therefore, output carry is generated and that is what I was saying c 0 is g if p is 0. If you see the same function if p is 0 then you can say p is 0 and c i is when if it is both are 0s output is 0 if both 1 of the c i is 1 the output is 1. So, if c i is 0 sum is 0 c i is 1 sum is 1. So, the truth table of a full adder essentially can be represented by this status carry status method and this actually represents the full adder. All adder implementations ahead except may be initial few of them which is ripple carry or kind will actually have logic in which we will actually get these p g and these functions and once we get those functions s and c can be represented in terms of p and g. So, this is the technique which all adder implementers use. So, you can see why people are looking for this please remember what is the advantage if something has to remain 0 you do not have to do anything. If it was 0 and it has to remain 0 you do not do anything. If it has 1 and you want to retain 1 you do not have to do it. So, operations can be minimized depending on the function you are choosing the kind of data you receive you know which functions will be useful which functions may not be required. So, the hardware can be actually programmed to do a basic requirement which have lesser power dissipation and lesser area. So, the first among them which is the most common adder which you use is essentially given by the term name which is called the ripple carry. Now, word itself is saying carry is rippling. So, it is essentially a simple task which we are not even generating p and g here it is independent of any function generation. You have a full adder circuits and we already have discussed that a full adder circuit sum is nothing but a x or b x or c i minus 1 or c i 0 and continue like this and some carry of course is a x or b c n plus a dot b or a dot b. So, this full adder can give you two terms s and c. So, I give the first input carry I give my first these are shown here a 4 bit full adder circuit each is this is 2 bit adder, but is a 4 bit adder actually. Now, each first two bits are here next the first LSB is here and MSB is here, but when I do a sum for LSB I will generate the first sum 0 LSB of sum, but I will create the carry now and this carry I know is required for summation of a and b. So, even if I impress a and b simultaneously this carry is not known to me first when this is operation is over then I receive this carry then I do this operation even if this data was known I have to settle this and only then I get this and then I generate this. Once I get this with the this third bit I can do this and the for the fourth bit when I generate c o 2 then only I can generate the final carry. So, one can see from here if you have a larger number of bits this is only a 4 bit and if you have 8 bits your carry has to go through like look at this how many is it 3 stages you have to go through if you have 8 you have 7 stages to go through for the carry and first to generate itself first time one carry and then keep generating ahead and propagate it up. Now, this means larger the number of bits larger will be the delay because carry will require so much so many additional paths to go through. So, typically one can say I have already said 3 bit delay for 4 bits because only 1 2 3 3 carry propagation. So, the total time for an adder delay which you can get is n minus 1 times the carry plus please remember every time because except for the last sum rest are anyway are coming in the same time and this was propagating I got this this was already I already got this. See only this part will require last sum carry because this additional sum you are not doing because during the carry going ahead you are performing the other s s. So, one can see that net adder time is n minus 1 t carry plus 1 sum which is called the critical time. The goal of course is the make the fastest possible carry path circuit this is ripple carry which will be slowest and will be has the delay please remember if n is very large this some time may be very very small it is the carry time which will essentially decide the speed of this adder. I just said you last time I only showed you the XOR circuit I thought before I leave this part may be I will show you the full adder which is implemented here that means I am implementing all of this sum function and carry function for a b c input this is my XOR as you can see from here and then this other and or gates whichever is required for carry and sum are essentially created here and if you see specifically some carry of course is very easy to create. So, it comes very fast, but the sum requires all such operations to come each adder part full adder part requires if you count this in a static CMOS 28 transistors. So, it is a huge area requirement and one of course one please remember all said and done at the end of the day I may say implement if possible only on static CMOS because there are other powers which we discuss in our power aware design which is at very high frequency the issue is of glitch and the leakage power and glitch powers are extremely high and if you have a series connections like this the leakage currents are the minimum and therefore, they will be the lower power dissipating circuits at the cost of area which does not seem to be very obvious when I show you this kind of block before I quit this I just want to say a very interesting property which logic allows me if I have a full adder and receives an input a and b and a carry c i I generate a sum and generate a carry and this is identical if I actually have a full adder which receives the compliments of inputs a bar b bar I also give compliments of input carry then I get and if I compliment the outputs once again then I get sum and c 0 this can be proved by this simple logic I repeat you compliment all you get compliment gain back to this. So, this is very interesting this is called inversion property. So, you do not have to do twice such functions you can have by an inverted self you can you can then create because normally you will always have a a bar b b bar c c bar available to you at the end of the day every logic will start with its compliments true end compliments and if they are already available you can actually use it to a advantage. So, before we go ahead we can say that minimize critical path by reducing inverting stages please remember inverting delay is the largest non inverting delay is the smallest. So, if you use blocks these are called odd cells and these are called even cells. So, you can see the first one instead of just c 0 you put c 0 bar now this can have a bar b bar kind and then you can keep generating. So, an odd cell should receive bars even cells should receive non bars. So, if you can do this the delay on this will be less than this and therefore, you can have overall power reduction by using an inverting property. Of course, we will come back tomorrow once again and we will look into another possibility maybe we can talk this quickly this is called mirror adder actually is now first time trying to implement the functions which are generated g plus p c i is the output carry and p x or c i is some which is function g and p and this particular function which is shown here you have a c i input here which is in a some kind of a series gate here you have a and b which is shown here a b shown here please remember it is p channel and n channel has same inputs. Whereas, across this your a b in series and a b in series and c i is your propagate. So, the idea is this if a 0 and b 0 we once carry to go to 0 independent of what this a 0 and b 0 this is going to be 1, but if a 0 and b is equal to 0 this is going to be 1 and this is the bar that means 0 is propagating irrespective of what happens here if a and b are 0 this n channel p channel device conducts this node goes to v d d and if this goes to v d d which is complement of c i which is kill I want to make it 0. So, 0 means bar of that is 0. So, I actually kill this you can see how I generate this if a and b are 1 and then only I generate if a and b are 1 this node goes to 0 because of n channel turns on. So, you generate in case a 0 and b is 1 either of the series chain will be off and in that case depending on c i if a 0 this will be a forming an inverter a 0. So, this will be block, but since this v d d will come here the output will be carried alternately it wanted a 1 to appear then this will be on this will keep it 0 and this output will be bar of that. So, I can propagate 0 I can propagate 1 I can kill or I can generate using this carry I can then put my standard logic once again similar ones a b c i and a b c i with the new carry which is generated here feed there and keep generating some sort of that. This is only you can see from here this is still a serial adder you have to keep putting a and b in clocks and this at the end of the day you will get c 0 and s bar out coming from this total number of transistor in a mirror adder at 24 compared to 28 earlier I shown. And if you see a stick diagram it shows here this is my v d d line this is my ground line these are the poly lines which common a this is my b c i and the same circuit with using a standard theorems we can always make a good good looking layout for this just blow this stick diagram. Before we end this we may say what is the advantage of a mirror adder the n mass and p mass chains you can look at it everywhere whatever is here is opposite of is here whatever is here is exactly opposite is here in the same opposite means this p channel is replaced by n channel this is symmetric and that is the biggest advantage in any implementation the n mass and p mass chains are completed symmetrical a maximum of 2 series transistors can be observed in carry generation circuitry when laying out the cell most critical issue in minimization of capacitance at node c 0 please remember this node is the most crucial which has to have faster time because this the carry is the major time. So, how fast this is so capacitance has coming from here from coming from here coming from here coming from here. So, what is the node capacitance and wire of course at this point will decide the propagation time in a great way when laying out the most critical issue the minimization capacitance node c the reduction of diffusion capacitance is particularly very important the capacitance at node c 0 is composed I already said the diffusion capacitance to internal heat capacitance and 6 gate capacitance in connecting the adder cell the transistor connected to c i are placed closest to the output this has to be understood wherever c i this should be always closest to the output we know very well in our design whichever because this line is evaluation stage this should come the last. So, it should pre charge before next one starts similarly this should pre charge otherwise this node will take time to fully charge or discharge these parts. So, always in normal logic implementation the last one which appears should be the closest at the output which is standard technique which we use here only the transistor in carry stage has to be optimized the s is not required because s required will not very much the minimum stage all transistors are small or minimum size the only thing you require is the larger size transistor for delay stage optimization. So, we will come back a little later about the other points of adders which we can implement and actually see to it that they may be better or worse than the different kinds of adder architectures which we will discuss later. Thanks for the day.