 Um, my name is Biris Hed, I generally, in my daily life I do lots of, uh, embedded systems programming, which means, uh, I get confirmed with all kinds of strange architectures, uh, different architectures, and people are generally used to, uh, which is why I'm, it just it's just a lot of important ability and getting codes right to work on one another plus problem. Um, the talk will basically consist of two parts, one first, um, do a part on what many users have seen, simplest plus programs, but there are also, uh, topics which concern all programming work in this session. Um, I would personally, uh, talk about that if, uh, common problems, uh, tips on how to solve them, and then I will go into a more abstract level on how the system, in general, is conceived, uh, what the bottlenecks are, and then how people try to solve them in fact that has, um, normal and that also, so it's very natural right now. I'm not sure how far we're getting in the second part, but I, I just basically want to do the use of that part, it's probably the most useful for the new people personally and see how far we get, and perhaps shortly we can take the part. Um, one important ability and, um, movement is concerned about with this, one reason for me is correctness. We, I think we should try to get our packages as good as possible. It also means, uh, write code which conforms to the standards. Um, and that's mostly not important, but it is about, don't assume things which are not guaranteed by the standard. Um, then in terms of the universal operating system, I don't think you can really display it by running only one part of the architecture. Um, the telephone is probably interesting, and I'm not sure, I'm quite sure that most people are not aware of this, but there are these living devices called among their readers which set that, that which came as a result, they are telling this, the most used embedded distribution. Now I think we should take pages with a grain of salt and sense that it's probably, there being this most used as distribution to the right for your own thing from the built-in embedded system. I'm quite sure that most of the embedded system is actually running fully in this area, but it's still a basis where a lot of people actually are aware of the job they work from. So I think it's important that we care for this. And the most used actually means it's bigger than any of the commercial embedded distributions which are available. Uh, Linux distributions are all already distributed? Uh, Linux distributions, yes. Uh, it's not all the others are. Yeah, it's only within the co-combatant space. Uh, hardware advanced may have been feasible on new platforms. Um, if you consider that, uh, we now have tiny 10 megabyte artists and even bigger silence coming up. That's mobile phones and other portable devices will come with artists. We get like 64, 128, and more megabytes will come with like 600 megas processors nowadays. It becomes like you can run a full demo machine and yes you can. I mean the restrictions are moving quite fast. So maybe it will become feasible on more and more portable and nice toys. Um, I think that's an important thing to keep in mind. It's just a might need to see and it's interesting to play with machines which were conceived by people who have computer knowledge for their ideas. And I think every devian developer should actually be concerned with this in the sense that he should be aware of this plot because it's not something a single or a small person or a small team can do for all packages because it sometimes requires intricate knowledge in terms of the package to actually solve it. I mean the case for virtual machines for this small part of the language. If you have, if there are all of these variations in this code, it can turn out to be very complicated if you don't know how to listen to the words. And not how important people know how this can serve to store a giant virtual machine. This kind of stuff works. And it's also, if everyone just takes care of this package on their screen, they take care of this package. Most of the rules. Um, the overall quality of the code to it. Even for platforms which are apparently working. Now, off to some practical things. C types. So, there are a number of things. The C standard guarantees you like the size of your basic types. Um, so they are somewhat smaller. The code is always smaller than the original to the short and so on for any role. You can make sure that the short and the short are always at least 60 bits. And on Unix machines you can actually assume that it's always at least 30 bits. Along with at least 30 bits. But on some platforms, 64 bit platform in general is 64. One big thing is pointers. Pointers are not necessarily as big or as big as these. This is typically true of us at A64. All 64 bit machines. Pointers are 64 bits, but they are not at the same time. Now, we are lucky that GCC4 is working as a part of this stuff. I think more and more of these perks will be noticeably important. The size of your card is our dependence. It means if you define the card variable, you don't know actually if you get a size or a good size card. This is very important if you... Well, this can be important if you abuse cards for things you are sure they are never meant for. If you use your card as a loop variable, which I have seen multiple times, and then you write that thing like, I'm going to be efficient and I want to document the loop size while E equals 10, while E is bigger or equal to 0 to do the loop. If your card is unsigned, it will never be bigger. We still know 0, so just effectively create an infinite loop. And if that loop card is used as an index in an array, then yes, you will obviously go memory at some point and get safe points. So, a few tips. First, always use int. As long as you know that any specific requirement, like it must be a floating pointer, it must be a variable which I have to send to some other device or network or whatever, and that device requires me to send 16-bit entities or the packet format requires me to have certain sizes. Use int. The reason is, compilers are always optimized to use int to have int. There are operations that make it as fast as possible. And they are also there, basically the native type you see. So if you have a loop variable, don't think about it using a card to save like three bytes because in most cases you actually don't save anything in it, there are a number of cases where the code will be a lot slower because you have to execute four instructions to do something with the 16-bit or 8-bit entities. So you don't do that. If you see this happening in the operation, just replace the card by int and compile it for most of the work and complain to the main, to the upstream, but they shouldn't do this. Now, there are obviously cases where you have to use specific sizes, like if you have to implement a binary protocol to a TCP or a UDP or whatever. And the protocol specification says like, this field has to be 8-bit or this field has to be 16-bit or this field has to be 7-bit size. Then you have to use it. Luckily, since I just see 99-bit support in GCC quite a while now, we have standard types which guarantee you like, this is an unsigned int and it will be 8-bit soft and this is an unsigned int to be 16-bit. So use those as much as possible. They will, the combined implementation will make sure that if you write U, it's 8-bit that you will actually get 8-bit entity and not something else. So you don't have to worry about the portability anymore. That's now how the library looks. Like I said before, don't use harsh to safe memory. I've seen people doing that multiple times and generally 3 doesn't really work because if you actually start looking at yourself, you will quickly notice that you are actually losing memory because you need more instructions sometimes to actually have good operations on this. It's okay if you use the string but not if you use the string. Use the latest GCC version to compile your package. Even if that's perhaps not yet in there, it's worth how to just compile and receive the warnings and possible errors so you can fix them. One reason is that new GCC versions tend to find more perks. They tend to warn about things which are good. It's usually okay if they don't know if you want to fly. You try to, you assign a unsigned file to a signed file. That's not a good idea. Or you assign a pointer to an int. Be aware that size might be different. GCC version doesn't do that. If you don't actually compile, you wouldn't notice that. Most people don't always. Nowadays, the compiler is getting better. And obviously, it's always, I think it's useful to compile your program with the latest version even if it's not yet in there because at some point the there compiler will upgrade anyway and if you fix your build status before compiling, there will probably save a lot of perks and work for all people. Then be aware of patience. Some people use patience. The idea of patience is that you can actually have an entity which is smaller than a single byte. And you can say like, I have 32 bits and 32 bits are F. This is like 8 bits and I have two fields. The bridge field is 3 bits and the bridge field is 5 bits. Now, most people assume that, assume the bottom structure. So they assume that if you do it, that like the three lowest bits will be bridge field C. So if you write something to a bridge field C, they will actually change the three lower bits. And they assume that if you actually access bridge fields, it will be looking at the five other bits. Unfortunately, it is not guaranteed by C specification anyway. And Intel does it the way mentioned in the second. The second line should not be 3 to 5, 3 to 5? No, the number is the number of bits. Just the name of the bit here. The name is the bridge field C and bridge field 1. It doesn't have any access to it. It's not the start of it. No, it's not the start of it. It could also be bridge field A, bridge field B, or a large pool. The other third, slightly confusing thing is actually, it would actually have been better if the three bits part of it always had the same fault. So we'll just pull it out. It's better to be chosen. But the idea is that Intel, for example, it would send you the lower bits while power is zero. So if you send it back, it might be, if you actually reliance on the lower bits again, it will completely fit up if you come along the platform. The problem is that if this is a network structure, for example, it might only show up. It won't need to show up in compiler. It will compile that way. Then you have a program, and suddenly the other side doesn't know what you're talking about. The other side says like, blah, blah, blah. This is a good way to throw away. Because you actually set the cap. Then you have to look in, using TCP, there are a lot of network trace to look in the source. is telling you no packets, and then you realize it's kitchen. And EMS is another probably more common known issue. So if you basically consider the means of turning from beta to t, 0, 8, 1, 2, 3, 4, 5, 6, 7, 8, if you store your memory in a little bit of a way, you'll see that it was first stored by 78, and then stored by 5, 6, and so on. They didn't erase it all the way out. If the EMS is more for storage reasons than anyone might. Well, there is actually one user, I don't know. The old, I see old 6, 5, 6, 7, 8. But I think we can basically get a lot of them. It's just telling you that it's actually a little bit more of a system failure than us. I don't know. I know it's in the internet only, but it doesn't really matter. A little alien, 1, 2 gets turned into a 3. 3, 1, 2, 3, 5, yeah. OK. Uh, so. No, no, no, no. I don't know. 1, 3 should probably be 1, 0, 1. Yes. That's wrong. Yeah, but. The point is that this happens not only in what you would think that I'd see in C++ programs. Like every program which has some sort of an external interface, and without external interface, I mean it lives so fast. It sends network pipelines. It talks to USB devices. All this, every time you talk to send data to some outside entity, which is not in your processor, any of this comes potentially into play. It's not only if you have binary data, if you use everything pasty or XML like that, that doesn't come for you, but in all other cases, any of this might be a problem, even if you're in Java or in World of Orbit. So the trick is never directly set your internal data structure to the outside world. Always have functions or at least microspaces or converges, and I do like it. Now I want to make a packet in memory and put this 32-bit empty and swap it to a little engine for every CPU I have. Use micros, which conferred between CPU and in escutche, which is a password event, but you can define it by depending on which platform you are and discuss it in any case. You have to agree to it. Yes, the Linux app, for example, doesn't quite match with the app. It's CPU to little m into micro CPU to little m, 16 CPU to little m, 64 escutche. So you just write, like, this field equals CPU to little m, 32 value, and it will always give you little m to the base value as a result. And you're always sure that that's probably the one you want to use the same for the game. Alignment, it's another problem in the sense that most of you see it, because you can only have problems actually to direct quantified operations. The problem is that most basic processors require aligned access or are very much slower if you don't use aligned access. So what I mean by aligned access is if you have a four-byte entity and you have a four-byte entity or an eight-byte entity, you always have to access it on a memory boundary of four bytes. So you get an aligned access of four bytes number would be if your address is zero or four or eight or 12 or 16 as it's all. But if you would access a four-byte entity on an understanding, you would have an aligned access on three and you would also have it same for eight bytes. Alignment access on eight bytes would be if you would access eight bytes and that's five for example. The reason is that it's generally you have to come eternally because your gross is like four bytes or eight bytes wide or in some cases more widely. So the lower address is basically you just put on you just look at only the more significant distance on the order two or three. Which means that if you have to which means if you have an aligned access you can put in a single cycle just put the others on the internet and you will get an aligned data. If you have an aligned access you can't do that anymore because your boundaries don't match and you have to actually do two practices on the current access. That's why in most architectures they are trapped by the current so you get a gross exception which can be really fashionable and fixed on a four-byte so it will actually do the whole work of doing two practices and combining the right bytes into what you want. Unfortunately this is slow obviously because you have the current access and if you know how much cycles of current it's also potentially non-atomic of this because you have no longer a single access and it doesn't work in Caroline. In Caroline you're not allowed to do any sort of like an aligned access. But that's not so much of an issue for most people. So the best thing to do is don't do that. If you have it's not always that easy because you can easily get packets from network which have like four-bytes entities which are aligned. I think as a me it's probably one of the most famous protocols finally the outsource of aligned access. The best way to co-operate this is to actually copy the data into a new variable which is sort of a mem copy of Justice Group. No, we're copying the data because I'm going to show you the compiler will actually look like all the server like they can only do byte fetches. It looks strange when it's in the add file. In some cases the compiler, if you do it right, the compiler can actually generate smart code and realize what this is going to look like so it can be used to get it right. But that doesn't always work because in some cases if you do it quite early and you start using casts then the compiler obviously can't know what the compiler is going to do so the access might still turn out to be... Memory organization is not so much of a problem except if you're into writing, into state version machines or java or whatever. The point is that your program stack does not necessarily go down. So think HPV8 is the major machine which has a pop-up score in stack. So if you think like, oh I can just scan the stack by just taking my current stack and going up, then oh no, you're wrong. It's a little tricky. If you try to games with variable lists, arguments, like you have file arguments if you think like, if you don't think like, oh I know where the next argument will be because I know it's just the previous argument was like four lines so I just take my stack on there and I go for the next one. It won't work because on some machines on this machine typically most arguments are kept in registers at least the first four, or the first six, or the first eight registers and only afterwards they switch to the second. So you might even have a stack point to point into the first. If you have like three arguments you can't just take the end result to the second argument and find the next one because it might be in registers not entirely not in them. Which is why if you try to play these games you will create a lot of machines. Now I don't think this is so much problem represent it's just doesn't work. Now next part am I in line? So the next part would be more abstract in the sense that it will try to explain why we have caches why we have delay why do we have all these problems with this and also problems with caching and non-coherency issues are quite annoying and you would start wondering why do we have all these features just annoying but there are very good reasons why they are trying to explain it like this. To explain this I would first have to draw some very basic and schematic on how what diagram and how the system might look like. It's not really what you believe it is but it's a starting point. So on a lot of the slides you have two processors that can actually be more than one processor but more than two obviously but I will generally call these CPU even though it can be more than one CPU if I just really want to talk about more than one I will tell you but CPU might in some slides might actually be more than one processor. The point is that all these processors have the same view on many memory. So if the CPU one leads up to the zero it will be the same value than if CPU two. Well but it was in the 60s. Yeah but yes. The whole diagram does not pose for any 64. And I will tell you about in a minute. So this is more the classical detail like we have processors which have a shared bus which goes to the bridge and also goes to what the peripheral bus which is generally PCI but can be more PCI expressed. Now this is only one way to do it. For example the every 64 systems don't go in that way what they actually did was I will explain in a while why they did that because there is a very good reason why they changed this architecture is that they moved the memory access to the processor. So every processor in every 64 actually has it on every subsystem. That's its own memory controller. Not sure if you have multiple channels these days but they have it but it doesn't really work well. They have a memory controller per processor. You can't access the memory on the other processor yes you can because there are links between the processors and if you access the memory location which is on the other processor it will send a message over there and ask the other processor to fetch the data for you to come back for a ride. So it is more like a nominal memory system which is was used up to now in big SGI and digital systems where it's the same thing AMD does and they actually don't really have this non-bridge strategy so they basically have third processors and they have a sort of bridge from high to transport to I think PCI Express they use these things which is also sometimes for the overworld. That's the problem still that you have one common interface for all buses and other interfaces on the computer. So you have like 6,000 people supposed to have their own memory page. And they all probably don't access PCI stuff. I would expect that every processor also has its own PCI. No, they have only people who don't die and not only does PCI and other stuff. I thought that AMD 64 the normal bridge is basically a hybrid transport PCI Express bridge. Yes, but you have a transport link for every processor. I guess it's going to be as close to the chest as it is No, all processors link one non-bridge which was PCI. That's all link series. That's inside series. Okay, I would have thought that I would have thought that it would have been better with it. AMD I know. So the second thing is why it looks like you don't you might think that this is a bridge but actually the other systems which is connected to the river bridge in the other end of the bridge but the actual switch we actually have seven links and SCI does it in the XIO architecture. So basically you have a seven link from the normal bridge to the river one or seven link to the river two and seven link to the river three and if they want to talk to each other they go via the normal bridge also access to switching. So actually it's more like a network than a very fast one or very fast, just one key in 1906 which was very fast for that time but still it's more like a network than an actual bus system. Okay, why do we have all these complicated things? Some of the separations process became much faster than memory you probably might recall the 486 times where we ended up in the process which now was blazingly fast and then we have got through like one year that's already ten times faster now we have to bet in four and AMD which are like two points I think maybe for those people three point two or so. No, they are too far more or less. Okay. And then also because of the 2.6 or so well, very not enough. Memory on the other hand we had like in the 486 days we had like 15 hours which was very fast. I think it was about faster. We do now have like double data with 400 megahertz memories so we do think that would be much faster. Well, there's one catch. The memory about to do a bit but even 400 megahertz scaled to 3.34 megahertz is like a factor of that. That would mean if you actually access memory and you would actually be able to launch requests to memory controller for megahertz you can't do that. But if you would be able to do it it would still mean that every if you do not have cache every instruction take 10 cycles just to fetch the instruction that's obviously not very useful if you have 40 hours processing that case. So the problem with memory is that the latency didn't improve as much. It almost, it's a bit faster but the time you need to do single memory access hasn't improved that much. So it still takes like I think up to 10 cycles 10 or more CPU cycles no more, but 20 or 40 CPU cycles just start to learn that much. Once you started it the data goes slower than 400 megahertz so it means that 400 megahertz starts to build the Raspberry. But the beginning is very slow. At least the solution stories are obviously useless purposes. Don't ever do a safe and transfer because the status of this first transfer are so high that it only pays off to do it if you manage to transfer a lot of data and how can you do that? You can either use processes typically with caches so you start to pre-fetch stuff instead of 16 instructions or 32 instructions at the same time and you hope that you can actually use those otherwise you can't just do this. And for IO mountains it's generally useful to use DMA, why? Because with DMA actually I program like for 500 kilobytes like I can transfer 8 kilobytes from this memory to this memory and DMA actually knows like oh I have to transfer 8 kilobytes so I can just large burst 64 bytes that's in dark and use my paging, my person was from those bursts and memory was good again. You should be aware that this latency issue is not only to memory but also for a PCI a single PCI IO access because thousands of CPU sites you just put in IO read in your program because thousands of processors are signed because you have to go through the bridge and add all the PCI to those and come back and all that time the processor is doing that it's a real world expensive answer before we have tasks that's a signal instruction well it's a signal instruction if you don't interrupt a signal instruction yeah I'm not sure that actually works for the whole store still it's known as new technology from the internet they're meeting a lot of people you usually do part of the execution of course you can't do another side but you can't do additional to trial it's just a bonus section bonus section is at a mile if you know what it does nowadays there's no intelligence to just see where you're waiting for IO so we can just use another way to go faster then only if you have to do which goes well yeah we need faster and it's not only obvious that it's just good IO so the basic circuitry is you can't program IO as much as possible put it in packets in memory write the whole protocol and everything you want to do because in that case your processor can actually do much more stuff while the device is doing all the work yeah I think I'm almost out of time so don't program IO is one of the most difficult styles about problems with how this advanced scheme one of the issues obviously if you start to use DMA you have all the entities in your processor you have to be sure that they all see the same thing definitely they are going to cash stuff because they copy stuff and if someone else came obviously you have to be sure that the top is still up to date now some systems are in hardware people like the most larger systems are in hardware the smaller systems start to be in hardware but some don't means that the processor actually has to realize oh wait I'm going to DMA on this memory location so I should not tell the processor to forget about all the frustrations of this part of the family because it's going to be overwritten by somebody the processor doesn't know about there are all kinds of strategies which you can use to copy this problem but in the end it turns out that if you want to use this it's rather complicated and the only way to actually handle this problem is to try having properly coordinated access to devices which means that you can't really do using that in unix is really asking for trouble because you might you can't just rely on the fact that if you add something in a 2 and i only just have to always be coordinated always do the right thing okay I think that's enough okay no questions it's probably a problem yes it is a problem one problem say we Derek you go to the next one no we do no we don't what's the next one