 So, what I will do in the course of this lecture is describe to you handle C and the basis for it, give you some examples and show you how engineering has returned to software where from a language level the handle C level you can have very tight control on the kind of hardware you lay down on silicon, right. You do not have to know anything about electronics just from playing around with the language you are actually tinkering around with your hardware architecture. So, your program does not your program is not interpreted on hardware, your program denotes hardware your program is hardware, right your program describes the structures which exist in your hardware. So, by playing around with the structure of your program you are actually playing around with the structure that you are creating in hardware. So, nobody should ask me at the end of the class what is the OS used in this, right. So, you know OS what you do is that you take handle C program and compile it using a set of tools into something which either programs an FPGA, right or gives you a netlist with which one can build an integrated circuit and send it off to a fab and make a chip out of it, ok. And it does not need you to have any knowledge of electronics and flip flops and things like that. So, disadvantage and a kind of disadvantage and the same argument applies that should you need to know assembly level language in order to build embedded systems, right. I would say now maybe not, it would be a liability perhaps, right. So, we will give another example of this a slightly larger example how to build a risk processor, right and this is the basis for this lab exercise which you will do this afternoon, right. And we will talk about some advantages of reconfigurable hardware also, ok. So, what is handle C? It is a programming language, right. Programming language which helps you take your algorithms into hardware. We have used it to great effect for building useful IP in our labs for BARC and people like this we built a reactivity meter which is essentially a big kind of a Kalman filter in hardware. We have used it to build encryption engines, ECC elliptic curve encryption engines. We have used it to build inertial navigation system IP if you like and all sorts of useful stuff like this. We have used it to build a video codecs, right. And the reason why guys like BRC were interested is that typically they have got some modules which are running on windows based machines like reactivity meter, Kalman filter, things like that, right. They are not very comfortable with proprietary operating systems naturally, right. First of all that there is too much infrastructure there, right and you do not know what exactly is there inside it. It is very difficult to certify any application that you run on it because to certify a given application you have to certify the entire infrastructure that it runs on and to certify windows is a pain, okay. So we thought why not just certify the algorithm and we run the algorithm on generic hardware which is an FPGA and FPGA is very easy to certify because it is just a generic kind of you know it is like memory, right. You put whatever program you like on it, right and it becomes that hardware, okay. So that is very easy to certify and then all you have to do the burden of certification just lies on the program. You can prove it correct and do all sorts of various things about it, test it to glory, do whatever you like, okay. And this has been a very successful project. Our goal was to make it at least as fast as the windows deployment because if you look at it the windows deployment of this algorithm is on a Pentium processor which is highly optimized the silicon with floating point hardware and everything inside it to really optimize these things quite well all your numerical applications. And we were able to get it to work on an FPGA which is clocked at a fraction of the speed like the clock speed on a Pentium as you know is a couple of gigahertz, right. Yeah, we have been able to get this going at about I think less than 100 megahertz on a an FPGA, right which shows you the amount of parallelism that we can harness in this. Typically you can do at 10 megahertz in an FPGA, right what you will need to do and above the gigahertz on the sequential hardware just by exploiting the parallelism, okay. And we have done this and we have found it very useful it runs about a few times as fast as the PC implementation, right which is quite promising, okay. So handle C is a programming language it is not a hardware description language it does not talk in terms of flip-flops and stuff like this, okay. And you want to use it to compile high level algorithms into gate level hardware. Syntax is loosely based on C so it is using C if you like as a Trojan horse in the sense that if I invented a new language and gave it to you you would have a immediate reaction to it, right. But these guys have been smart they have just embedded the concepts in NCC, right and made NCC plus this extra bit compilable into hardware, okay. And it is to hardware what sees to micro-assembly code if you like, okay inventor's Yen page Syntax is based on horse communicating sequential processes model. So there is a good theoretical background to it, Occam I have mentioned and it is being used by I think guys like Ericsson, BAE, British Aerospace, Creative Labs and all these guys, right. And they found typically when they use this kind of paradigm their design cycle times come down to about half typically, okay. So the hardware design as I said is exactly the hardware which is specified in the source there is no intermediate interpreting layer as I said, okay like in assembly languages which target general purpose microprocessors logic gates are your assembly instructions and the most important thing, right again because of time to market nowadays, right time to market gives you no choice but to move to these kind of tools because you cannot get the kind of productivity that you need and you desire using C and these kind of tools unless you nobody uses that much C and E way, right is UML and stuff like this that they use which are already higher level tools, right. So here you can design redesign optimize at a software level even before you deployed it on hardware, okay parallelism is true I will give you examples of what that means, right. Your key construct here is a par construct where you have two components to it or more components separated by a semicolon and what this means is that the instructions of a program or block A and block B are executed in parallel at the same instant in lockstep basically by two separate pieces of hardware or if I have n such statements blocks n blocks are created where all of them are stepping in parallel like this, okay and branches that complete early are forced to wait, right by the hardware which is produced until the longer block completes and then you fall out of it, okay. So this is all that you need to know the semantic model is that is that assignment takes one clock tick it is a synchronous hardware that we are talking about, right. Synchronous means that there is a global clock, okay. The only thing you need to know is that an assignment statement in this language takes one clock tick and everything else you can say it is free or forget about it, right and it has consequences, right which you will see but the nice property that falls out of this axiom that you have decided that assignment takes one clock tick everything else you forget about it is that you can look at a program and know what its timing behavior is going to be because typically in a C program you will never know what is going on if you define various threads and stuff like that you will not know which thread is racing ahead of the other and stuff like this, okay. In this you will know at every clock tick what is happening so you have cycle accurate control over your the hardware that you produce cycle accurate control, right. The meaning of this will come out as we go along, okay. Now if you compare it with C, right it is similar in the sense that your program is inherently sequential, similar control flow constructs if then else switch while for and all these kind of things is dissimilar in the sense that it has no malloc and the dynamic store allocation, no recursion, no nested procedures, no standard in out variable with words, okay. Can you tell me why it does not have malloc? Why the language designers have not put malloc exactly, exactly so this is hardware that we are talking about. So malloc requires virtual memory and stuff like this, right and no recursion, no nested procedures same reason, no STD in STD out void main because there is no command line arguments and stuff like this but it has got variable with words which are very significant I will talk about that it has got power and stuff like this. But you know having said this, right it is like a lie almost because no language is frozen at any given point in time this is the point at which this technology has reached now, right. But if you look at hardware now, right much of this divide between what should be in hardware and what should be in software exists because of the specific technology at given point in time essentially memory was expensive in the past, right which is what has caused a lot of boundaries to enter in our head as to what should be in memory and what should not be in memory and stuff like this, okay. But with the kind of way you are getting so much memory that you do not even need to think of hard disk anymore, everything can be on the hardware itself, right. That rubbish is this claim that we need not have malloc, we can have malloc really why not implement virtual memory also in hardware, why not, right. So as memory becomes more and more cheap and you do not need to think of staged memories like you know RAM and cache RAM and then disk and stuff like that why not put everything in hardware, right. So this might change tomorrow, you might have recursion also and all these kind of things. So there are no Patarki Lakirs in this space because the space outside there is very exciting, things are changing very fast, okay. And typically what happens with many embedded applications is that you do not want hard disk anyway, right because hard disk leads to unpredictability, right because normally when you talk of safety critical applications and things like that, right. You look at the worst time, worst time delays and stuff like this in your tasks as you would have encountered yesterday, right. And the problem with putting virtual memory in a system is that you cannot predict your worst cases always very bad because your worst case will be that you ask for a piece of information and it is not there in the cache, it might not be in memory and it might be in hard disk and seek time of the hard disk and all those kind of things. It might take too long. So you do not want hard disk in a safety critical application. So this space is changing, even disks might not exist tomorrow, right where so things are changing quite dramatically, okay. So it has got the usual standard NC functions like, but it does not have these things like print and put and all these kind of things. Unfortunately it does not have functions like sine, cos, square root and all that yet because the libraries, maybe it is there now, right. But a couple of years ago it was not there and we have had to build a number of these functions ourselves also, okay. It supports these usual things like variables, arrays, switch, a statement, for loop, all these kind of things, right. So you can see what is going on. If you were a language designer and you had to build a language like C and you want to take C to hardware, right. You can think of hardware analogs of all these things, right. What is a variable or a memory location? If it is a memory location you would call it RAM or an array, right. That is what a memory location is, right. Switch statement, if then else, multiplexer kind of thing, right. And for loop is a guard basically with which you can loop, right. So you can see already if you were a language designer how you would compile your program into hardware itself, right. So it's quite exciting and the way you do your scheduling or rather you do your control flow is by having a notion of tokens inside your hardware. Token is like some point where handshake happens where control is passed from one part of your hardware to another part and that way you control, right. You are the control flow of the system. An important construct that they have which is not there and see is the notion of a channel, right. So channel if you like is like a shared variable, right, which is used for synchronization and communication. The channel is the basis for which you would implement things like pipelines in a codec with different stages you would implement the communication between the modules in terms of a channel, right. So here is an example of a channel. Link bang v and link question v where link is the channel. And bang v means you put the value on this channel and link question v means that you take a value from the channel. So it is used as a means for synchronization between various hardware blocks in your system where communication is an event which happens when one guy is willing to put a value on a channel and the other person is willing to read in which case that event happens instantaneously, right. So if there is no value on the channel and this guy is waiting for input from the channel, he is blocked until the point that he gets a value. So that is the way to synchronization has happened, okay. And this provides a link between parallel hardware branches where one branch outputs data onto channel and the other reads. And data transfers only when both the processes are ready. Any questions? Yeah. Yes. See in messages you have a queue while this is instantaneous. This is like a handshake which happens. One variable, just one, it's a queue of length one maybe. And in a message you can put the message and continue work, right. Here you are blocked until the other guy can take it, essentially. Pardon? Like a semaphore in a way, okay. Implemented using that but let's not talk about semaphore because I just don't talk about it here, you don't need it even here, right. So semaphore is an implementing mechanism but we need not know about it here. So this, here's a fragment of a program unsigned int 8 bits wide A. Channel unsigned int 8 bit wide C. You can say on channel C I will write bang, a value 5. And then on channel C in another part of the program I will read the value into a variable called A, right. So what happens is that you have two parallel bits of hardware, right. And this guy is writing a value 5 on this and this guy wants to read a value A, right. And this event is instantaneous, right. When this guy is willing to transmit, if this guy is willing to receive, that event will happen and both these guys will keep on continuing, right. If he's ready and he's not ready, this guy will be blocked and so will this guy, or you'll be waiting here, or if you're not reached here yet, right, this guy will be blocked. So it's a point of synchronization. You bring both the guys up to this point, that event happens and they continue, okay. Is that clear? Yeah, so you don't even think about commit points. You're thinking in terms of hardware now, right. That's up to the implementors. So don't take too much headache on yourself, right. I'm telling you as a language designer, I'm giving you a construct. Where if I say channel bank value, right. And someone else does channel read value, right. So that communication will happen and it's a blocking construct, basically. So when both guys are ready, this event will happen. You don't need to know anything more. Now it's up to me as a compiler writer, as a compiler designer, that when I map this into hardware, I'll take care of all these details. About commit and how to manage the signals and all that, right. So it's a very good question in the sense that it's the first time I've been asked that, right. And it just shows you that what the language designer has done is that he's insulated you from hardware. If you assume, if you just accept this paradigm, this axiom, I'm saying that this is a channel construct, this is how it works. You don't need to care about how it's actually built in hardware. But it will be built in a way which will be consistent with this behavior which I've explained to you, right. This also explains how by layering of the problem, right. I've taken away a lot of detail work from you as to how actually these constructs are implemented. So you can play around on a much higher level of abstraction, right. And be much more productive, right. So you'll see more examples of this. Here's another construct, right. So note what is happening is that these guys are giving you a new language in which to, a new vocabulary with which to talk hardware, okay. Which you got nothing to do with flip flops and all that. And this vocabulary is very expressive, it's very high level. And you can be much more productive in this vocabulary, okay. And it's up to the language designers to think what is useful. So on that basis, they come up with these constructs. You can come up with more. In fact, Handel C is not the be all and end all of languages anyway, right. It's one step in the transition of different ways of building hardware, right. In fact, one might say this is a bit low level itself, because you have cycle accurate control of hardware and so on. You might not want that. You might want to just specify something like a MATLAB or a PsyLAB program. And then you couldn't care one bit as to how hardware is created. But this is still like, if you like one step above VHDL and all these kind of things, which needs the knowledge of electronics, okay. So priority is a statement, which is like a case statement where you can have a series of events which can happen. And related to that, it will have a statement, right. So like if you have four different events here, right. You want to communicate or rather you want to read an input on channel A, read on channel B, write on D and write on C, right. Any one of these which happens, you just pick up that event and execute the statement which is next to it, okay. So I will give you an example now of a program, a small program, right, which does some, okay. So wide min, right, is this. You have unsigned integer, 16-bit wide, unsigned integer, 8-bits wide, right. And channel, channel in is input, channel out is output. Sum is initialized and do from input you take, input and output are default channels, right, which when you are doing a simulation of your hardware in this tool, right, input is mapped to the keyboard and output is mapped to the screen. So it just makes it easier to simulate, okay. So you read data from input and then you say sum is equal to sum plus data while data is not equal to zero. So the first zero it finds in the input stream, it takes it as a terminator and then kind of it just does sum and spits it out on the output channel, right. It is quite a simple program. But I have introduced two concepts here, the variable width word. This is something that hardware guys are very conscious of, right, variable width to us who normally write software. We do not even think about word width, right, because whatever we do, right, it is short, long or default or whatever it is, right. And we do not really care because it does not really affect our program that much either, okay. That is because whatever we do is actually mapped into a 16 bit or a 32 bit or a 64 bit architecture and it is quite well optimized and all these kind of things, right. But in hardware, right, when I say a word width is a certain number, right. It is actually laying it down as registers and any assignments to that register or out of it and things like this requires a wire that wide, a bus if you like that wide. So, if you say 8, an 8 bit wide word, it is laying down an 8 bit wide bus essentially to move values in and out from the variable, right. So, if you have a large program and most of the time you are just doing Boolean operations like say a traffic signal operation where light can be on or off, right. Then you just want to use one wire, okay. So, you have one bit wide variable and you will find that the hardware that you laid down dramatically reduces, right. Or you can have 8 bits or 16 bits or what have you, wide words, right. Or in the case of like what we have been doing with elliptic curve encryption and you can have a 500 bit wide word, right to manage your encryption algorithms and stuff like this, okay. This is the power of this. So, you can just lay down as much hardware as you need, okay. And so you have to just do some little bit of tricks. This is a concatenation operator and zero concatenate data means because sum and data have different bits, right. So, you have to pad out data with zero in the MSBs, most significant bits in order to do the sum here, okay. Is this small program reasonably clear? Okay. Two slightly more ambitious program. This is a divider where your result is equal to integer, the integer part of A divided by B, okay. So, again we go define data width wide. 16 bits is the default width now. Unsigned int, data width wide, A, molten result. Data width time 2 minus 1 is B, that wide if you like. Chan in is input, Chan out is output and while true do this. Here is another point. Any embedded system, unlike a software system, for software systems you like it to terminate at some point, right. But embedded systems are different in the sense that they never terminate, right. So, these while true loops will never exist, typically in software but they exist in embedded system. How many of you would like your phone to just stop working one of these days, right. It has got an infinite loop which is constantly looking as soon as you turn this on as to what input it is getting from the various buttons and stuff like this, right. So, non-termination. You take the operator and operand from your input channels, then put a value B, you initialize it with the result and say mult is equal to 1, shift left by data width minus 1. So, you are multiplying it by that factor. Shift by 1 is like multiply by 2 as you know, right. And you initialize result is equal to 0, then you enter a main loop and then that computes the result and then you put the output on to or rather you put the result on to the output channel. If you cannot understand the algorithm here, do not worry, it is a standard algorithm for divide, right. But that is not the point I am trying to make. The point I am trying to make is this. That inner loop which I have said is this. While mult is not equal to 0, right. If this then do this, else you do this, ok. This is the innermost core of your loop and as you know 99% of the time of any program is spent in its inner loops, right, usually, right. So, this is where most of the time is spent and you have a comparison here after which this is executed or this is executed, right. But I have been a bit clever here in the sense that I have got two statements here and I put a power block around it, ok. And I have got two statements here, right and I put a power block around that also, right. So, tell me, how many clock cycles would this take? One volunteer, I was saying one. Any others? Two clock cycles, ok. The answer is one because if I did not have this power block, you are right. You would have one assignment, second assignment. Two clock cycles and same thing here. So, if this holds then you have two clock cycles and else you have two clock cycles. But as soon as I put power block, it is actually executing in the same clock tick essentially. So, it takes one clock cycle. So, as you can see, by using this very simple construct, you double the speed of your program, of your hardware, not program. You double the speed of your hardware, right. Now, if you had three statements in each of these limbs or five statements in each of these limbs or ten statements in each of these limbs, right. You would naturally use a little bit more hardware, right. But you would get ten times the speed, right. There is no way that you can even think of this kind of optimization using a sequential processor, Pentium and things like this, right. Where whatever code you write finally has to be executed on a single stepping single processor, right. Now, you have code duo means that you have two processors. I am saying here you can have as many processors as you want and you are only limited by the amount of gates that you have available to you. That means you are only limited by the amount of silicon you have available to you. You can have parallelism to any extent that you want as a designer. That is power, right. You cannot do that in a sequential processor, yeah. That is why with a 10 megahertz clock speed FPGA, you can beat the hell out of a 2 gigahertz Pentium processor by doing a lot of tricks like this, right, running code in parallel. And C, because the C program runs in a sequential paradigm on a von Neumann machine which is basically you have a long sequence of things. If then else it goes tick, tick, tick, if not then go here, if then go here and all these kind of things or you keep on looping here and so on. The basic paradigm is sequential. Okay, this is disturbed a bit when you go into specialized hardware that you have for say graphics visualization and stuff like that where a lot of pipelining and parallelism is taken advantage of there to speed up your graphics using the OpenGL standard and stuff like that, right. But this totally generalizes the whole concept. There is available only for graphics. Here you can do it for anything. So, this is the source of your power if you like, okay. I will give you another example of a pipeline, okay. So, here is a pipeline where you put inputs here and you get outputs here and these are three stages which are working in parallel, okay. Three stages working in parallel. So, what you have is channel unsigned int undefined link 2 which is your channels, right. So, you have link 0, 1 here, 0 and 1. So, two instances of links. So, it is an array of channels if you like. Channel in unsigned int 8-bit wide input, 8-bit wide output and channel where unsigned int undefined state. Undefined means that I do not want to bother to put widths and all that. Let the compiler figure out for me, right, how wide it should be. State, right, which is state 0, state 1 and state 2. So, it is three register locations basically which are reserving here as an array, okay. And then you say par, put these in parallel, right. Three pieces of hardware which are being put down, right. The first piece is while 1 do, take a value from the input channel that is this one and store it in the register state 0 and write it out on to link 0, right. So, it read this and write it out on to link 0. So, that is one piece of hardware. The other piece says read from link 0 and write on to link 1, right, which is this piece of hardware. And the third says read from link 1 and write to output. So, this is your pipeline, okay. Now, you could be passing any kind of stuff down this pipeline. It could be different stages of an encryption engine. We use this for building video codecs where each stage of the video codec does different kind of processing and stuff like that, okay. You can use it for many, many applications, right, signal processing and all that, right. The first stage does conditioning of the signal and the other can do Fourier transform and the other can do various other things. So, this is basically how it works. So, essentially when you start putting things on input, right, how many stages, how many clock ticks later will you get the results on the output stage, on the output channel, three clock ticks later, right. And then after that you will keep on getting it at every clock tick. It is like a usual pipeline, okay. And the nice thing is that you have now three pieces of hardware which are working in parallel. Each one will take, yeah, but the fact is any value to transit through the system, it takes, from here to go here it takes one clock tick, from here to, here it takes one clock tick, from here to here it takes one clock tick, yeah, so exactly. So, what you are saying is that you are putting these three blocks in parallel. All of them are acting in parallel. So, when a first input comes here at input, right, it is taking that input and at the next clock tick it is putting it out here, right. So, until that time all these guys are blocked. And note what is happening, I do not need to think about how to implement it in hardware, how to do, how to effect as you said the blocking and commits and stuff like that, right. That is taken care of in the implementation. So, the compiler writer has taken away the burden of building hardware, right. So, he is the guy who knows hardware and he has done it. And what you do is that you take your handle C program and you play around with it on the DK4 environment. It is just like writing a Borland C program or something. You have this development suite, you can do simulations, you can see hotspots, which are the parts being exercised a lot. It is just like doing software development, right. You can single step and see where all the points of control are in the code and so on. And when you are happy that the code that you created is correct, logically correct, right. Then you say, okay fine, give me hardware. And what it does is that it can give you either a VHDL output, right, or it can give you a net list output. VHDL as you know it can be translated into hardware. Net list is if you like an abstract characterization of a circuit essentially, right. So, it can give you both these outputs. And then what you do is that you take your FPGA tools, right. Because any manufacturer of the FPGA gives it with tools like in Xilinx, you have the Alliance tool set. And that lets you take your VHDL program or your net list and actually write it to a particular device, okay. So, you push it through this tool set and you get if you like an image, which is the basis on which your hardware is programmed or reprogrammed if you like, right. So, it is quite straightforward. So, you just have to write handle C, simulate, simulate, simulate. Press a button, you get VHDL code and you push that through this tool set and you get your hardware running on an FPGA. And the way we work with hardware design teams if you like, is that we say that look, we are responsible for only the FPGA part of the design. Baki's of you take care of, right. And this is the interface protocol that we use between your part of the circuit and our part, right. So, at least we know how to interface with you guys and how much memory you kept as buffer in between and stuff like that. And then we write handle C code. And then maybe if we have to we write some little bit of VHDL code which acts as the hardness and the in between between this and the circuit. But look at the wondrous thing, right. Using this you can have your design time of a system. We have been working like this with a number of agencies, okay. And it works, right. And we simulate the code and I mean it is quite useful. So timing as I said assignment statement takes one clock cycle everything else is free, okay. And the important thing is one clock source for the entire program and assignment and there is another statement called delay which takes one clock tick. Delay is a statement which the compiler writers have provided us because if you find conflict to a certain resource between two parallel blocks it could happen, right. Then you just push one block down a bit by one clock tick by putting a delay statement and that solves that resource contention problem. And you will see it is designed in a way that an experienced programmer can immediately tell, right which instructions execute on what clock cycle. But then you have artifacts like both these things execute in one clock tick, right. There are no assembly instructions. This just is translated into gates. But as you know, right, this gets translated into a structure where obviously you have times and it has got values coming into it. So there are some clock cycles which need to be. So what happens is that this has some logic depth and any logic, right when it is kind of deep like this there is some delta time which is needed for values to progress, right, up the tree, right. So this is like a tree of logic, okay. So which means that in order to respect this axiom which I have told you that everything will take one clock tick, right. I have got to give this program which has these two statements, right. A clock cycle or a clock duration which is long enough, right to be the time for these things execution plus some delta, right. So what this means is that your clock tick will be fixed the speed of your clock will be fixed on the basis of the longest expression in your program which is rather unfortunate in a way, right. It means that you have one horrible statement and it slows down the entire clock of your hardware, right. You get that, okay. Now you say see nothing for free is not that great really, you know, right. Caught you there kind of thing, right. But there are ways around it, okay. I will come to that, right. So essentially what we are saying is that we have to clock at the longest logic depth. So you essentially reduce the depth of the logic to speed up the program, okay. So what you do is that if you have a long statement like that you can break it up in two parts on the plus maybe and execute them as separate statements, right. And then you just add the results in another statement. So instead of one long statement you have three short statements but your clock has dramatically speeded up, right. And the environment, the hardware environment, the development environment points out all these kind of things to you where you can speed up your code and stuff like that. So this becomes the second nature thing once you do this kind of stuff. So here is how you will build hardware. When you want to build some hardware, right, what we do is that we search on the net for a program that does the right thing, right. So open source, open cores, this, that and the other, right. And when we get the program that does what we want it to do then we start doing our engineering. We take it and translate it to handle C. First we just put it in raw form with as few changes as possible without parallelism, this, that and the other. And we see if it compiles straight into a usable form. If it compiles and it gives us the speeds that we want our job is done, right. If it compiles and it gives you not the speed that you want or it creates too much hardware or whatever then we start doing engineering on it, okay. So what this tool does is that when you take a program and compile it it tells you how much gates are required for it, right. How much silicon if you like is required for it, okay. And it tells you the clock speed that you are getting at the moment, right. So that is the basic thing that you work. How many flip flops are used and this and that and the other. Then what happens is that if you are within your bounds, right, your job is done. If not then you got to start doing some work. And this compiler gives you approximate idea of the number of gates and all that because this is device dependent. FPGAs come in various families, right. You can have plain vanilla FPGAs or you have some with a certain amount of multiplier resources on it and adder resources and all sorts of things. Nowadays you have systems on a chip, right, which has got on chip memory, on chip signal processing, on chip lots of things, right. So the whole space of what you are compiling to is changing a lot. Like FPGAs are becoming very rich also. Like 5 years ago the biggest FPGA we had in the lab was 1 million gates, right. Now we have another one which is 6 million gates and you are getting 10 million gate FPGAs. In 1 million gate, you can write, take a Pentium II, right and write it as a program and create your processor if you like using this, okay. 6 million gates, you can be a bit more ambitious. Or we have a small company which has been incubated here, Pawai Labs at IIT Bombay, which has built a FPGA based engine for doing the simulation of hardware, right. So what it does is that it has many banks of 6 million gate FPGAs and it takes your chip design which is 1 billion gate chip design and stuff like that of Intel and all these kind of Motorola and all these guys and actually simulates it on the FPGAs, right. So if you have your huge chip design, right, which is a large net list if you like, right. What it does is that it indicates to you where you can cut this net list and the cut fragments of net lists, it can host it on different FPGAs which all talk on a backplane bus, right. So you have a hardware simulation of hardware, right, which is quite exciting, right. And there is some very interesting research work which has happened here as a result. They are also offering this for climate simulation and all sorts of, you know, high bandwidth processing applications and stuff like this. This is Professor Madhav Desai here, okay. So here is how you take your scene to handle C. You decide how your software maps to your hardware platform, partition the algorithm between multiple FPGAs, port C to handle C and use your simulator to check the correctness, modify code to take advantage of extra operators that you have in handle C like PAR and all these kind of things. Simulate again to ensure correctness. Add fine-grain of parallelism through PAR and parallel assignments or parallelize the algorithm and so on. Simulate it again and see if it works, okay. Then after that when you are happy that the logic is correct, then you start adding your hardware interfaces for this target architecture and map the input and output channels to these interfaces and you simulate again. And then finally you generate your VHDL or code or netlist or what have you and then you use your FPGA tools to actually put it onto the FPGA. So you will be doing this whole thing this afternoon. So Rajul Patkar who is a lab in charge will after me talk and lead you through the DK4 tool set and let you lose on it in the afternoon. We have got various FPGA kits based on this Part 3 FPGA and things like that. So you can actually try it out in the afternoon. So here is a visual analog of what I have said. Algorithm to handle C, compile it to .NET file, simulate it, simulate and debug. You might need to keep on going round and round here. Once you are happy you add the interfaces to external hardware. Use the compiler to generate netlist. Use FPGA tools to place and route and program your FPGA. Simple. So essentially what we are saying is that this software approach allows us to rapidly prototype applications. And handle C gives you a seamless approach to derive fast implementation. And our real problem nowadays because the cost of silicon is falling is actually the high cost of programmer time. So you need a higher level abstraction with which to build your systems which means software based high level approaches are much more useful in solving problems. So here is a small recap of the things. I do not want to belabor these points. I have done it enough now. Concurrency. Here is the example that I told you of taking a deep statement and add the results and you can speed up your clock. Here is another example of concurrency. I do not want to belabor this. This slide illustrates the use of this delay statement where you are assigning a value into X1 and you are putting in parallel with say putting a value into X at the same clock tick which creates a conflict or contention for resources. And if you put a delay statement here it pushes it down and resolves the contention. This is just a naughty example but you might have a much more critical kind of thing going on in your actual application in which case you want to avoid this kind of situation. See we have gone beyond this in a way. I have a pastry student who is working on he said that why should I have to do such low level kind of tinkering around with my code? Why cannot the compiler itself think off and figure out all these things for itself? Why cannot the compiler itself take my inherent sequentialism here and parallelize the heck out of it basically? So why should I have to give cycle accurate kind of description of my program? So he is working on another approach where you take a C program and you generate hardware straight out of the C program. And it is quite interesting that we are standing on the shoulders of compiler writers to do that because what compiler writers do is that in the process of translating the language into machine code they come up with an abstraction, a graph based abstraction of the logic in the program where they have extracted a lot of parallelism out of this and brought it out. And then unfortunately what they do is because your processor is a sequential processor they map that parallelism back into sequential code. So what we have done is that we have taken that abstract characterization of parallelism done some more work with it and mapped that into hardware. So we have defined a kind of an intermediate language where you describe hardware where hardware is seen as a patrinette for control and a data flow kind of a graph with a box that talks between the two things. And using this paradigm you can take many of these things and combine them and do various things and you can just take this architecture itself and map it on to various kinds of hardware. You can actually interpret it on a processor, you can lay down FPGA you can create VHDL code out of it or what have you. So we thought that why go through this additional step of taking C to some other language because there is a lot of IP there in C and nowadays we do not have time. Wouldn't it be nice to just take the C code itself and make hardware. I won't go into macros but I will just give you a brief description of the sorts of things that you have available. You have the usual bit manipulation operators like X less than dash 2 is an operator which says take the least significant 2 bits of X. While this one says drop the least 2 significant bits of Y. So take and drop and this is concatenation that means take X and concatenate with Y. What will the length of Z be? X plus Y basically, length of X plus length of Y. This is like a bit select operator. X index 3 means that it will take the third bit of X. 2 to 3 is like a bus select. It takes up a subset of bits from a variable and width of X will give you the width of the variable. And if you say Y m colon n, the order is MSB colon LSB. So if I say Y is equal to 4, which is what? 1 0 0. So Y 0 is 0 and Y 2 is 1 as you know. And then you can have external RAM and ROM. You will get more examples of this in the next part of the class. I won't belabor the point here. But this is how you actually address actual hardware from inside Handel C. So you can say RAM unsigned int this, external RAM 8 bits of 8 elements is off chip. So off chip is equal to 1. Data is on these pins of the FPGA. P01, P02, P03, P04. Address is on these bits. Write enable on this. Output enable is here. Chip select is here. And so on. This is where your hardware shows through in Handel C. But this is quite easy to manage as you will manage this afternoon. So I won't belabor this point. I would like to go into an example. This gives you a total kind of map of all the statements which are available. I will leave you to look at that and what the timing of the statements is and so on. I would like to just go ahead. Risk processor. Let's build a risk processor. And you see it's quite interesting exercise. You want to build a risk processor. Risk is just a reduced instruction set. You can take liberties with this word. So we want to design a processor which has got 16 instructions. It's got 4 bit IO ports. It's got one accumulator. It's got program memory which is 16 by 8 ROM. And data memory which is 16 by 4 RAM. And the problem is that you want to execute a program stored in ROM to calculate the first few members of the Fibonacci sequence. And your Fibonacci sequence as you know is this 1, 2, 3, 5 where each number is the sum of the previous two numbers. And this is a mathematical characterization of Fibonacci where you have Fib of n is equal to 1 if n is equal to 0 or n is equal to 1. And Fib of n is equal to Fib n minus 1 plus Fib n minus 2. Simple. That is your requirements specification. Now we have to build a processor for it in hardware. So we figure out, we can have these instructions. 10 instructions. The first instruction is halt which means stop processing. Load I, load and load I which means load a value from RAM into X, your accumulator. And load I is load a constant into X. Store is store a value in the accumulator into RAM. And add is add a value from RAM to X. And sub is subtract a value from RAM from X. Jump is unconditional jump to a ROM location. Jump not 0 is jump if not 0. Input is read a word from the user into the accumulator X. And output is write the value in the accumulator to the screen, the user. So this is your instruction set which you can dream up. Now we start writing the program. So when we write the program we say we define our input and output channels. Then we say data width is 32, opcode width is 4 bits. Operand width is 4 bits. RAM address width, no ROM address width is 4, RAM address width is 4. And then we define our opcodes which we just define as numbers basically. Halt we say is 0, load is 1, load I is 2 and so on you have all your opcodes here. And then we have an assembler macro which can take your opcode and operand and assemble it into a single instruction word. Are you comfortable? Say that it is opcode, so you have your opcode in the right 4 bits and your operand in the left 4 bits. So you say operand shift by opcode width and add it into opcode. Then straight forward enough. Then you write your program. This is the program that denotes or it computes the Fibonacci sequence. I will not go into details of this here but you can figure this out offline that this program actually does compute the Fibonacci sequence. And your code fragment here is that you say ROM unsigned int undefined because let the compiler figure it out how wide it should be. Program and we initialize the program with these values. We use the assembler macro to actually make up the instruction words. So load I 1, stored 3. So this is your program basically right up to here. So now you define your ROM here basically. Your program is in ROM. Then you define your RAM for the processor. RAM unsigned int, data width wide data. One shift left by RAM address width. So how much can you represent in this thing? What is RAM address width here? 4. So one shifted by 4 is 4 steps. So it is what? 32. Is your data width. So now unsigned int data width wide is data. Your data memory if you like your RAM is that big, is that long. Then you have your processor registers which is ROM address width is program counter. And opcode width plus operand width wide is the instruction register. Data width wide is the accumulator as you would anticipate. And these are macros to extract your opcode and operand from your the instruction register. So IR take opcode width gives you your opcode and drop opcode width gives you the operand. Here is your main program. You initialize your program counter to 0. Then you have your processor loop do fetch decode execute is what your processor does. In any CPU fetch decode execute. So you are doing fetch, decode and execute. So in parallel you can do a fetching and at least the increment of your program counter because they are not conflicting. So you put into your instruction register program indexed by PC. And then you advance the PC. And keep on doing it while opcode is not equal to halt. And this is essentially the main structure of your processor. And then you have a main decode execute loop. I mean part where you actually read the instruction register and actually execute what is asking you to do. Which is here. This is the core. When you see a load statement case load switch on opcode basically. When you see load you put into the accumulator the data value indexed by the value in the operand. And load i is that you take the operand value itself which is there. Pad it with zeros obviously and put it into the accumulator. Store is that you obviously initial put into this data RAM. The value in X. Add is you add into the accumulator the value referred to in the operand from the memory location. So you index data with that. Subtract is doing the same. Jump is what you just change the program counter based on the value in the operand. So here is what you do here. Jump not zero is that if the accumulator is not zero then you jump. That means you change the PC otherwise you just leave it. And input is that you read from the input channel value into the accumulator. Output is that you put the value X on to output. And default this is a design decision of mine that if you see an instruction that you are not expecting or which is not there in your opcodes then you just go into an infinite loop. Because I do not know what to do with this I do not want to compute. So that is your program. And you are not meant to read this but this if you like is your risk processor. In one page you have defined your risk processor. Now you start playing around with this code a little bit you will come up with an 8051. If you play around with it a little bit more and do more clever tricks and all that you will come up with a Pentium. So basically this is a starting point. So we have had student projects on the course where the guys have taken this language and created an 8051 processor for execution. So they have created an 8051 core if you like which we have run on the FPGA. And written little programs which run on the core and stuff like this. So I thought this is pretty elegant that you can do this kind of stuff and this is hardware. You can write pipeline processors. You can write core duos. You can create as many cores as you like processor cores and you can write a little RTOS also which you can put onto the FPGA. You can have a little scheduler which takes your program and does some scheduling and then puts it onto whichever core you want and stuff like this. So this is hardware and I have not needed to speak about flip flops and gate and bus and this and that. And the wonderful thing is that hardware means what? You get hardware speeds. That is all we really want as software people. We want things to go fast taking as little resources as possible. I will give you another example. Take the example of you want to write a small video processing application, a security application where camera is looking at some view and as soon as something moves of a certain size in the room. It triggers off an alarm. Typically if one had done it in the past in the old days on even a Pentium processor, it would have been quite difficult because if you just look at the amount of work which is going on. Even if you have a 300 by 300 image that is almost 1 million bytes if it is a grayscale image per image. That means 8 bits per byte which is 8 million bits times 25 at least which means 200 million bits per second coming onto the processor. And 1 gigahertz or 2 gigahertz is okay but you are having a fairly fast data rate here. Especially nowadays you have high resolution images. You can forget about even 2 gigahertz or 4 gigahertz Pentium. So when you are dealing with this volume of data, in the old days you could only build hardware. And hardware meant that you get into VHDL and very long and start designing boards and stuff like this. While here an undergraduate student can come in one afternoon he can learn handle C and write an application and deploy it. On the prototyping board that we have. In fact this is something that we are planning to build on a robot on the lab in a box. So you can do these kind of tricks also on it. So you can have little robots which have, you can write the visualization algorithms on it using handle C. Put it on the FPGA. So you can turn your small 1000 rupee video camera into visual kind of sensor. Which you can do interesting things with. So we find this pretty exciting. So we have a simulator which is built into the compiler here. This is a graphics pipeline. I will give you some instances of where I had seen this being used. Some friends in Imperial College in London their students built a 3D visualization engine based on OpenGL. So they took the entire OpenGL code and ported this onto an FPGA using handle C. And what they do is that they found that when a 3D model is far away from the eye. There is a lot of work being done in hidden surface elimination and all that. And when it is close to the eye is doing a lot of ray tracing kind of work to get the pixels on the screen and stuff like that. So they said would it be nice if we could morph this hardware into different things depending on where this thing is. So when it goes far away it uses more hardware for hidden surface elimination. When it comes near it uses more hardware for ray tracing. So based on that they built this kind of an engine where you have a geometry subsystem and rasterization. Where you have transformation and lighting algorithms here. And then these are the blocks essentially of texture mapping, shading, hidden surface elimination and all that. So you could throw a new hardware onto the FPGA as your requirements changed. And the cost of throwing new hardware onto the FPGA reprogramming time is of the order of microseconds. All you need to do is essentially stop the FPGA, load some bits from the RAM if you like and start it going again. That's all you have to do. And this makes a lot of sense now because now you can think of a big piece of hardware which is not used all the time. And use your FPGA and keep on reusing the hardware depending on where in the program you are. So I think there is a lot of scope for innovation now which never used to exist in the past. Where you have a totally, in fact Intel also when the head of research came here a couple of years ago. He mentioned that Intel is also thinking of having an FPGA resource on the processor itself, on the chip itself. Like at the moment they have Core Duo which is just two processors. That means you can scale up by two. While having an FPGA resource means that any specialized algorithm that you want you can dump that offload it onto an FPGA. And if it's on chip that's great because having it off chip is very expensive. I will give you an example. We tried to do some work with taking a scheduler and putting it onto an FPGA a few years ago. What had happened is that Professor Krithi's team in UMass took a scheduler and put it into VLSI. That scheduler was a dynamic scheduler. That means that it's called spring. And it was doing all sorts of heuristics basically based on which it would schedule real time processes. And they felt that we are limited by the time it takes to compute the schedule based on all these heuristics and all that. So what if we put that into VLSI? So they did a lot of work and came up with the chip design in VLSI and stuff like that. So your processor would have a means by which it can offload computing the schedule onto a specialized chip. And then get back the schedule and schedule the task. All this is great because as a result of that the scheduling time was almost zero now. It was of the order of nanoseconds of what have you because of the specialized chip. But they got hit very badly in DMA to ship that data out of the chip onto another chip through memory access and all that. And to get it back again was so expensive that it defeated the entire purpose of having another specialized chip to do the work. But now with these new options that you have of the FPGA core being on the main die of the processor itself the offloading cost is nil. So I am saying that this whole space is getting very exciting that new ways of thinking are becoming possible which were never available before. So it is a very very exciting space. And plus what is happening is that as FPGAs are becoming bigger and bigger you can put larger and larger problems onto it. And they are becoming cheaper because FPGAs are based on memory technology, SRAM technology or what have you. So what kind of problems should you put onto this kind of thing? I will give you a few years old example. For a geometric visualization problem software on a PC at the clock rate of about 400 megahertz this was old Pentium. Gave a frame rate of 24 and cost was about 1000 dollars. Xilinx XCV 1000 which is 1 million gate FPGA at that time had a clock rate of 40 and would give you a higher frame rate because of the parallelism that it could exploit at 4000 dollars. NVIDIA which is specialized chipset if you like for graphics visualization would give you a clock rate of 170 and give you 55. Even higher frame rate at just 200 dollars because this is specialized chipset. So model of the story here is for things which you can buy off the shelf you do not want to use an FPGA because it is a more expensive way of doing things. But consider this problem this is a defense application where you have a defense infrared camera which is capturing information and you want to play around with infrared images. Then you do not have any hardware available the kind of hardware you have software on a PC gives you a frame rate of 96 clock rate is 400 cost is 1000. Xilinx FPGA clock rate 40 frame rate is 330 at 4000 dollars. And the Onyx SGI Silicon Graphics Onyx Reality Engine would give you clock rate of 180 2750 is the frame rate and 200 1000 dollars almost. So this shows you where FPGAs start to become useful. When you have boutique algorithms where you have specialized processing where you have need to build a very fast prototype. Typically you would find these in radar applications in any specialized encryption engines or something which is not yet become a commodity item. Where you cannot just crack the problem by picking up a DSP chip and doing a job with that. FPGAs would be very good where you want to do some prototyping work. And increasingly what I am finding is that especially with the defense requirements now where 30% of all we are spending something like what is the defense budget? 1 lakh crores right. Defense budget in India is 1 lakh crores right. 30% of that the government has mandated has to come from Indian contractors. Even if a Ray Theon or something reposes that means that 30,000 crores worth of projects are going to be done in India. That means a lot of that is going to be electronics building right. And a lot of them will be defense related applications, radar and all sorts of interesting detection kind of things where you want to consider FPGAs and the things like this. So FPGAs is a low cost platform for custom graphics right. And development time of a customized FPGA renderer is comparable to optimized software. So good for is quite effective to use a reconfigurable platform. And good for designs where ASIC is not available or it is too expensive right as I said. And good for exploring desirable algorithms and architectures for ASICs okay. So this is a hardware design. This was a 1 million gate vertex board that we use in the lab. We have a 6 million gate board also right. Let me see if there is a picture of it. So essentially you will see we just have a PC with this board that sits on a PCI bus and it has got 4 blocks of RAM of 8 megabytes each right. And you have total control over the board through software. You do not even have to open the PC. You can write any IP on it and you can ship from the processor on to the FPGA, get the data back and stuff like this. This is a wonderful prototyping board. And the nice thing is that you do not have to tinker with hardware and you can get hardware speeds out of your software level okay. So the whole perception from where we are coming on this course is essentially that the cost of silicon is falling right. So that is bringing a real change in our thinking. Products are getting very very complex. Time to market is becoming very small. Shortage of trained micro electronics guys which means that the cost of programmer time is a major constraint as you all know right. So software based approaches become more and more attractive. So that is where we are pitching this course. And these new generation of languages let us build these systems at a high level of abstraction as I am saying now. So essentially this means that okay we can start now with this naive attitude that okay fine you do not need to know any hardware. But as you progress you will find it useful at least to start getting to know a bit more details of hardware and you will have to do it and you will do it right. But the burden and the bulk of the kind of work that you will actually do will be using high level tools right. And if you are thinking to yourself where do I position myself in my career in future and what are the skills that I should pick up and stuff like that. You should be looking at developing well algorithmic skills, domain specific skills by anticipating where the markets are going in future right. Like one sector which is growing at a fantastic speed is the automobile. Well components and automobile space itself in future. 40% as I said of a modern automobile the cost is in the electronics and there I just see software okay. There are lots of opportunities there especially because foreign companies are opening research labs here they are moving all development activity here and much of this work is just a software design work. For these kind of systems which are here a lot of defense applications are kind of coming here lots of embedded systems in the health medical electronics all these kind of things all the work is happening in India and when you look around you there is very little that is not embedded and one major major application area is mobile phones right. There is so much work going on in the mobile phone space and indigenous developments happening in actually not only designing the core IP of the mobile phones and the network protocol network stacks and all that which exist but even applications. What is happening is that the mobile phone is increasingly being is increasingly replacing the PC in many applications because it is a very low cost entry into the computation space right. So these are kind of places one can be looking at in future and these are kind of skills one will increasingly need right. MATLAB simulating higher levels of programming system C all these kind of things. Any questions?