 We have some benefit from pictorization of code. Why have we not used it so far? I'm going to claim that it's one of the best things the policy has. Why hasn't it been used more? If it can be degraded into Linux, what are the ways to do it? What is Debian's role in this effort? I mean, OK, this is a dead call, and I'm doing another presentation. There must be a connection. How can we use Debian and how can we benefit from it? And some coding samples, actually. I'm going to skip the presentation and show you the code in the benchmark directory. So, basically, this was introduction. Is there anyone who doesn't know what Outlook is or what CMD is? I know it's the same thing, but I'm very little about it. OK, basically, CMD units don't really enable you to do things faster. They enable you to do more things at the same amount of time. They process more data with the same instruction. CMD stands for single instruction multiple data. It's the same in theory. It's the same unit as MMX or SSC2, 3D now. They don't make the processor run faster, but they enable the processor to process more data at the same amount of time. So it is acknowledged worldwide to be the most complete CMD implementation out of it, but it doesn't. It is consistent. It enables the programmer to write code, actually code in C and not just assembly languages needed by other CMD units like SSC or MLX. One of the nice features that I really like is when I insisted that I actually stress this point. Compared to other units under implementation, Antevec gives three operands to every instruction, and the result is kept in another register, which means that basically you don't need to reuse registers all the time like you use in other implementations. For example, to do addition in MMX or SSC, you have to place the result of the addition in one of the registers of the operands, which means you can meet the first result. You have to reload it again the next time. So where can it be used? Power consumption is something that is more important. Of more importance, the more we go into the embedded market. And personally, I would like to see the G4 and G5. Well, not really G5. But G4 approaches to go more into embedded markets because it is comparatively low consumption. It's just about 10 to 15 watts. Where to find a similar performance CPU that offers a vector unit like Antevec, you would have to go to a Pentium 4 or AMD64 CPU which is just out of the question for the embedded market. Whereas, there are some cases where it is not really enough to just put more processing power in terms of higher clock speeds. You have to find another way to do it. For example, you cannot really depend on higher clock speeds to do encoding, video encoding, especially real-time video encoding. When you have to do it with custom hardware. Security is one of the sections that you can use for visualization. You can find it. I have written a paper about this. Not exactly this one, the whole security section. But I put that it's possible to vectorize some hashing algorithms and some inclusion algorithms because there would be more before me. And given quite good performance, I'm going to give some numbers afterwards. The codecs and computing linear algebra is one of the favorite benchmark of Apple to show Antevec power. And what is not really apparent is that we use Antevec for generic computing. I mean, mostly, Antevec has been used for that matter, everything, all other SMD implementations have been used for video or other applications. They have not really been used for generic computing. It's not really a part that can be used for generic computing, but you can, as we will show. So this is a small, nice innovation I found from Mars Technica. It basically explains how the, say, SMD approach works. This is a scalar, this method is a scalar code. It just processes one unit at a time. You will immediately recognize that this is a simple for loop. You have NARIOF objects, just loop over every object and try to process it. This is the SMD approach. You do the same operation, but on multiple sets of data. And this is one way to do it. Right, so let's say, is there a point or a stick or something? There one can turn. Basically, you have two arrays, A and B, size N. And you want to put the result of the addition in this second and third array. So, using plain C, what do you do? You just loop over elements and add the gem. Now, to do that, with another SMD implementation, you would have to meddle with assembly code. But without fake, it's not really necessary. Actually, it's not even in carriage, unless you're after 100% performance. You just have to write the equivalent C code using artifact extensions, which are included in the C, courtesy of Apple, and FreeScale. So, basically, just as before, you create two, you have two arrays, A and B, and you want to put the result in the vector result. So, you just find the corresponding command, vector add vector of A vector B. And you loop over each element, but there's a difference. The first loop does N iterations. The second loop does N over four iterations. It's what you have to do with integers, which has to do with integers. If you have to do with eight integers, with characters, this number is 16. Basically, you process 16 bytes at a time, which, as you will see, offers great performance. Right, why hasn't it been used so far? It is actually used. In Mac OS X, Apple has really done quite a good job in writing especially optimized routines, for that matter. But the thing is that it needs quite extensive knowledge of the subject to really benefit from it in the open source software. There are some projects that have actually really good limitations for very specific retuners, like mPlayer does, ffmpeg does. M encoder has been output to MySQL giving really good performance. But these are all specific applications. In last September, I think it was, I watched the talk of Sergio Larkin in the free-scale sndf, yeah, sndf, right, in Frankfurt. And I noticed how easy it is to actually program using gantt. This was the initial kickstart, so to speak, that enabled me to start working on it. And I was really surprised by how quickly I could get fast performance out of just a small CPU like G4, which is, by today's standards, G4 is slow. But it's not slow if you can actually learn how to program it in the proper way. So I raised the question there. Why can't we use it? Why can't we use Altevec to enable the whole system, the whole operating system, to benefit from it? Right now, Altevec is underused in Linux. It's practically non-used, only in very specific applications. Why not benefit? And I asked Genesee, which is the producing company of Pegafors, and FreeScale, and they decided to fund this effort. And so far, I have quite a big list of applications and libraries that can be Altevec optimized. I'm going to start with GLC. Basically, I've released a better version of Libre, of Altevec optimized common routines, like memcopy, string, swab, block move, anything. You are going to see the results soon. This library is eventually going to be integrated, well, hopefully at least, if we can convince the GLC maintainers to accept our purchase, to GLC, which means that every user, every PowerBC G4 and G5 user will directly benefit from it, just by updating their package. I have also done Adler32 hashing algorithm, which this particular one is used in Dbex encoding, and also ZLabre. The BracklayDB hashing functions, sorting functions, as you will see, insertion sort, merge sort. These two, the first two, are already finished. And then, waiting on Quicksort, the results are really impressive, as you will see soon. And on the two lists, this is just a small list of what we could do. Basically, I encourage anyone who has some software that things would benefit from Adler32, just send me an email or just tell me an IRC. And eventually, maybe not me, but maybe someone else will work on it. So how Dbevn can help? I mean, this is really, initially, it's not Dbevn's issue. It's really more upstream. But I found out that using Dbevn has enabled to learn to know the proper people that are responsible for each task. For example, Dbevn has very knowledgeable JLPC people. And they were very friendly. And they basically told me that, yeah, you can finish. If you can write something like that. And if we manage to resolve licensing issues, which are really no problem, because there is no company involved that has copyright or trademarks, then it's quite easy to just transfer copyright, transfer the IP to a DFSF. The thing just includes the purchase in the JLPC itself. Also, Dbevn is, I think, in my opinion, it has one of the strongest power-PC communities. They will directly give feedback about this optimization, whether they actually work or they are buggy or anything. And yeah, it will also make Dbevn the factor distribution for power distribution. Of course, since the parties are going to go upstream, eventually they will be found in other distributions as well. But yeah, I'm OK with Gen 2 growing popular, but I mostly care about Dbevn. Right, so these are just a few numbers. I'm going to show you the actual results running in the system. Mem copy is almost four times faster. How is that possible? The active unit has four times the bandwidth of the integer unit with the cache on the CPU, not with the actual memory bus. The actual memory bus is still 32-bit. Yes? What's the size? The size. The size depends on the, sorry? In this example, what's the size? The size of the buffer, right? Yeah, yeah. Basically, you are restricted by the cache size. Because my question is, processor has a cache. So Dbevn, if a cache is spilled out or insufficient of data, we need to bring data from memory. So four times faster or n times faster, it depends on the copying size. Exactly, yeah. Basically, can you tell me, can you show the source code? Yes, yes, I will show it later, yes. Mem copy, yes. I should clarify this a little bit more. About the memory copying, this would actually be a lot of debate in some RSD channels. Basically, we try to benefit from Altevex pre-caching, well, cache fetching mechanisms. Before we try to copy the actual buffers, we try to prefetch from the cache. The previous command just prefetched the next buffer to be copied. So this gives us substantial benefit. Even for cold cache hits, you benefit from that. Because by the time you try to copy the buffer, the small chunk of data, you already have it on the cache. Even for cold cache, you have at least 20%, 25% speed increase. For hot cache, for data that is already in the cache, we have pretty much about this four times faster. So of course, I should mention that Altevex is really useless on small data, on small amounts of data. I mean, it's not really, there's no point in using Altevex for just copying gate bytes, right? But you have to work on, say, one kilobyte or one megabyte, then it's really worth looking at it. Is that not very expensive? No, I'm using it outright. Not really. What's the break point when it doesn't work? It depends on the case. For example, if you want to do sorting, for memcopy, I should say that around 90 bytes, one other byte is maybe even less. 64 bytes would be OK. So the NCHR and string length, if you actually think about it, this instance, just check in small chunks of data if a particular byte is there. In the NCHR, the string length is basically the same. One such for an arbitrary byte and string length checks for byte 0. Altevex offers one very convenient distraction. Just returns a boolean. Is the byte X, does the byte X exist in this 16 byte vector? Yes or no? Which is very convenient. If you just take 60 bytes at a time, look at the code, look at the data, ask Altevex to check if the byte 0 is there, and if it's there, then you can find the position with other scalar things. But it is still faster than going to get the scalar way of the start. Man said, it's about 10 times faster for small sites. I should clarify this. If you go for every structure base here, if you go over the cache, outside the cache, the speed against drop, of course. Well, it depends on the algorithm. For example, with something, it really doesn't matter if you have one megabyte or 100 bytes. Because this is an algorithmic modification that is much more fundamental than the actual architectureization. But I'll give you this algorithmic optimization. It's not possible without Altevex. So I should consider these two very strongly correct. I know which of you know the Memphrob function. It's a C function basically that doesn't XOR. You give it a buffer. It XORs the buffer with the value 42. Just that. It was a joke function. And I tried to see if it could actually be vectorized and what would be the game. Since the original C function, C scalar code, does a byte basis. It takes a byte, XORs with 42 and puts the byte back. This does it all the time. Without it, it was about 24 times faster. Swap. Swap is very important for byte swapping. It can be used, for example, for switching data from one architecture to another. For example, from PowerPC to Intel or AMD. It's used for writing out audiospeed. It's one of the functions that you use to write audiospeed. Hashing algorithms. These are algorithms that are used. The particular ones that I vectorized are exactly the hashing algorithms that are used in the Berkeley DB library. Because of the nature of the Berkeley DB library, I wasn't actually really able to benchmark. So I don't know. But I know how fast the algorithms are. I should write here that also I'm working on vectorizing the MySQL hashing algorithm. It's a little more tricky, but I think it will work in the better way. I mean, all these functions, all these algorithms, basically are part of one family, one whole family of hashing functions. If you vectorize a family, then you can basically just throw anything. I will show you in a few moments. Adler 32 is the hashing algorithm that is used in ZLibrary to check for check sums of data, chunks of code. To give it a chunk of code, it produce hash. And then it checks the hash with the one you've given it. If it's the same, then OK, go on fetching the next set of data. The artifact version is 2.5 times faster. And it is only that with, I would say, with just the first implementation. I suppose I could make it even faster. But I felt, OK, 2.5 is OK for the first try. Maybe afterwards, when someone else will see it and find it, OK, you can do this. And this is a more clever way. The ZLibrary, after profiling the library or actually a utility of the library, meaning GZ, which uses it's GZ replacement basically, I profiled it and found about six functions that would benefit from vectorization that were very time consuming. I quite easily vectorized two of these. The others were more tricky to do. And the benefit was about 25% faster work. It was not a huge difference. But 25% is 25%. And this is my favorite one. I think this is probably one of the largest performance gains from algebra. I'm going to show you the exact algorithm later. And there actually bends my chronicle in this system. So I don't know if you can actually see this. This is the exact mem copy benefit. This is the scalar. This is not very visible, is it? Right. Supposedly, there's a red line coming here, going like that. It's approximately 1 gigabyte per second, the bandwidth of the system. Well, without it, it would go like that. For very small sizes, this is approximately here. No. Yeah. Well, in about 256 kilobytes here, the level to cache sizes is smaller than the actual size that I'm trying to copy. So I should see the performance going like that, asymptotically, to the scalar version. But for small sizes, until the 60 kilobytes, which is the size of the level one cache, you have four times the speed. After that, the performance drops, but it's still faster. I've had some very recently. I have had Marcin Kurek, a guy from Morpho Westing, who used to ask me to use these functions in Morpho West. Morpho West is another operating system which follows the Yamiga OS wave direction. I mean, it is compatible in some ways with the Yamiga OS. And I don't know exactly how Morpho West works, but I know that he used it. He used these functions, and he found, actually, their performances as mentioned. So even for small stuff that you don't really want to work on vectorizing, you don't care about vectorization. You just want your code to go faster. You could use some of these functions, all that are already made, and plug it in your code, with just a complete link, and you use it. So this is the insertion. Is it visible? Yeah. Basically, supposedly, sorry? Yes? Yeah? If size is increasing, then the line will be dropped to red line? Yes, eventually, yes, it will. But even so, because of the caching prefetching mechanism, there will be a slight benefit, even for that reason. Instead, if we introduce projecting function into red line, the normal main CPI, we can achieve more? Not really, because if it's already in the cache, you don't benefit. If the data is already in the cache, if you try to prefetch it, then the operation is a no operation, basically. It's no. And it does nothing. You just waste one cycle. But on the whole, if you know that your data will be on the cache, you don't need to do prefetching. You waste one instruction. But for generic use, you don't know that your code will stay. There is no guarantee that your data will stay in the cache. So it's wise to actually use prefetching, in most cases, at least. Yes? So the caching speed is actually right behind the cache? Yes, that's good. That's true for the big size, but for the size of what stays in cache, you have much more speed, because you have full time in the cache spend time. Yeah. The point is, actually, interest in why the G-Hips and memory copy function will not be able to utilize the cache, or not? No, it does utilize the cache. But it doesn't utilize out of it. Yeah. That's the point. I'm going to show you benchmarks. Yeah. Have you? Yes? Another point of this is that I.T.Lac has the possibility to bypass the cache, right? Yes, but this is slower. And that means that if you do a main copy of a bit of steam, you just use the CPU, you just empty the cache of any useful data from the program, while using I.T.Lac, you keep the useful team in cache. Yes. The caching cache controlling is one of the more advanced techniques of gaining more performance. But it's quite tricky to do it in one program. To understand the huge rise around the D, around 15,000 bytes. 13,000 bytes, 16 kilobytes, is the size of the L1 cache, the level 1 cache. Yeah, but that's an artifact feature, so. Yes, it is. The ultimate unit has four times. It has 128 bits, bus to the cache, the level 1 cache. Whereas the integer unit has a 32-bit bus to the cache. So this is actually a four-times speed increase. This is precisely the reason. OK, but why does it drop the make long then? Because the memory bus, the 128-bit memory bus. Sorry, the 128-bit bus is not to the memory, but to the cache. If the data is not in the cache, it has to fetch it and you lose cycles this way. So whether you like it or not, the performance drops. Yeah, the performance increase is only in theory. No, it's not in theory. As I said, mostly, most times the code, the data that you're trying to do to work on it is partly in the cache. And the perfection is done, but not in all data. Yeah, if we save it mentally, it's on the survey. Basically, I think the test case at hand is when copying the same area of memory to another space over and over again. Yes, the benchmarks, I've, well, I'm going to explain later how the benchmarks work. But since you've brought this subject, I tried to do two ways of benchmarking this code. One was trying to work on data that is all the times in the cache. This is the four times being increased. The next method would be to take a huge array or a huge set of data and then randomly pick data buffers and then try to copy one to another randomly. I knew that they were not in the cache. Even then, the performance increase was substantial, more than two times. Just wondering what the point is in those low areas where everything can be written to the cache. I think you are basically missing the actual writing of the cache. No, actually, you don't. Once you sent the, once you stole data back to the destination address, then it's the responsibility of the CPU to actually copy the data from the cache to the memory bus. Yeah, but unspecified renders will happen. It will happen on the next one, we'll try. You don't know that. And basically, since you don't know when this will happen, you don't see it in the benchmark. And as you overwrite the same area of memory again and again and again. Yes, yes. So basically, the big speed increase is not actually writing the data from the memory copy. As I said, if you work on small sets of data, consider this example, you're working with block sizes, these block sizes, and try to copy one to the memory, and maybe process it and then write it back. And then you do it all over again. If you were doing it, of course, you have to load, put the data in the cache, process it, then store it back in the cache. The CPU will find out that it needs more, that this data has been written, has been stored, so it's not really useful anymore to the main code, the main loop. So it has to free it to load another block. This process really benefits from Antibank. The process, I don't know if you know Martijn from ISE, Melko, well, okay, Melko is his nickname. He uses this code on the file system he writes for Morph OS, and it proved to be actually worth it. I mean, he saw really, he did see performance increase. I don't use Morph OS myself, so I don't know how it benefits from that. But anyway. So, shall we move to the next one? Right. The instruction sort algorithm. I don't want to go into details of the algorithm, but I basically say that it's an n-square, an old big old n-square algorithm. So if you have, this is a buffer, a sequence. Okay, and you load it. If you were working on it on a scalar base, on a scalar way, you would just see this as a complete sequence, and you would place bytes accordingly, or not all over the sequence. So you would have to work on, this is our 4 times 16, this is 4 bytes. Okay, so you would have 64 square, maximum directions to perform to sort everything using this algorithm. Using Altivec, I load this thing in registers. This is a 16 bytes, so I just load 16 bytes at a time. And I have 4 vectors, 4 antivec vectors. I try to use everybody in the vector as a column. So I sort everything in parallel. I sort 16 vectors, 16 columns in parallel. This is quite easily done using Altivec. The code is quite fast. I'm going to show it to you, and of course the code will be free, software, GPL, probably, I don't know. And after you have sorted all the columns, you still have the burden of merging everything back together into one single piece of data, sorted data. So the next step was to take all these sorted columns, as you can see, and put them back together into one single piece. What was the problem? All the merge algorithms I had found in Google, using Google, they were all about having two sets of sorted data. I couldn't find a generic merge algorithm that could take any sets of sorted data. So I had to write my own. Let's say it was an extension of the existing one. Again, the algorithm, I've just finished a paper on this insertion sort and the merge sort, where I explained the exact algorithm of the generic merge sort. So the integer sort is up to 54 times faster for 138 keys of characters. Of course, the situation is slightly different if you have short integers or 32-bit integers. But if you have to sort 100,000 characters by it, then this algorithm is... I'm going to show you the results. But isn't the... The most cases where you use insertion sort is part of a quick sort when the petitions are really small. And if the data is set up to be used in other operations, then it would defeat the whole purpose. At first I thought so myself, but I've done... Even for the smallest case I could use, Aldivicon, it was 32 bytes. Just two vectors. Two vectors and just sort one column of two elements, 16 columns of two elements, two rows, basically. And then use merge sort on that to produce the actual sort of sequence. Even this one was actually faster. It produced more or less the same number of directions, but for that kind of sizes you don't really carry if you lose some milliseconds. I mean, yes, Aldivicon does have a higher overhead to use than proper just normal scalar code because you need to prepare the registers and masks and whatever. But for sizes that are bigger than, say, 236 bytes, then the 236 square is quite a significant number. Yeah, you mean in general. There are some times where in session sort is faster than quick sort. Basically this whole deal started when I saw an article in Slashdot about using the GPU to sort elements, a GPU version of quick sort. And I thought, heck, why not try Aldivac? Would we have any benefit from it? And then I tried to see what are the most common and used algorithms. One is quick sort, but quick sort has disadvantage. It is not a stable sort algorithm. In session sort is one of the most common and it is stable. And it is actually very useful when you have almost sorted data. I mean, you have some data and you insert some other data in the middle. But the previous data was already sorted. Quick sort, that is very slow. I tried that. It depends on what partition you use. It is a basic algorithm. If you use a good partition algorithm, it would have really good speed on the piece of data as well. True, but then you have noticed that all the partition algorithms that are offering good performance, basically they are, how to say, focused on specific sorts of data. Maybe totally random data that needs to be sorted or already sorted data. If you want to replace a very common routine in the kernel or in the GWC, you have to be generic now. And the quick sort is not considered to be the fastest routine. It is considered to be the most efficient in a generic way. This is why it is so popular. But there are cases where another algorithm is faster. And anyway, the next step is to vectorize the quick sort itself. I mean, just do a parallel quick sort using algebra then using again merge sort to put everything back together. So this was, in my estimate, I think it would be about 400% in speed increase because of what you do. Because the sizes would be over four times lower. Right. It is useful. In a few months, we expect to have ultimate code in GWC for people to test at least very few common functions, maybe not everything. But so far we've used about 15. Next thing is Z library quick. As I said, I've already done some initial work on. LibreMcrypt, which is mostly used for quick sort code. STL, I don't know which of you know the Mac port of STL, which is out of Mac STL, which is non-free software, unfortunately. One idea is to try to convince the offer to release this free software code and integrate it in some way back into the STL standard library. It offers similar results. For example, sort is about 10 times faster. And it would really be fun to see how normal applications that use C++ would benefit from it. LibreDV, which is in the code library for digital video, the list is really long. I have it on my other computer. And yes? I was going to say about context switches. Well, to tell you the truth, no one has actually measured that. I've asked so many times in the PowerPC development, well, kernel teams, and they said, yeah, we don't know. Basically, they think it's higher, but it's supposed to be higher. But the benefit, in my opinion, the benefit would be worth it. We have saved registers, which... Linux doesn't use that. Yeah, I know that. I think only Mac OS does. They use it in the process? I think they do. We have saved registers and registers, which allow you to say, we modify these registers and say... Basically, so far, they haven't used it at all, because, well, no one actually uses Altevic extensively to care much about context switching. But once you start to use in a system-wide way, then it's probably going to be significant work. At least look at it. And I have some links here. Our technical guy has a very, very good article about how G4 works in general. This is a free-scale presentation about a tutorial on Altevic, basically, which is really what got me started. It's really very nice. Very well done. And, of course, Penguin TPC has very good links. TPC Zone is a very active forum for policy users and fans. I also have some papers with some Altevic example code in my account in the end. And that's the end of it. I think we have barely enough time. Yes? Have you seen that Venus has a version control system called Git and that somebody contributed some PPC Altevic code for connecting Charbon Sun? And they seem to be very enthusiastic about making it very fast. No, I didn't hear that. It's in the source table of Git and that seems to be very optimized. And it looks like it's probably in Altevic and in Assemble. Wow. Okay. We'll probably ask you the link afterwards. It's a similar question, but do you use Altevic in user-run-to-counter-run-to-copy? At any time, a lot of time, you can shoot user-run-to-counter-time-copy. For example, how to disk access, data read-write, dcbit, and so on. So using Altevic in that communication, we can achieve more performance. Exactly. There have been many articles, especially by Fliscale, on this particular subject because you can combine with Altevic, you can combine the checksum code and the actual copying. I've seen many documents and many papers on this, but I think it's very, very well stressed that, yeah, it can be perfect. And you can actually use the G4 CPU for, say, even gigabit switches, which is not, as I understand, although I'm not an expert in the field, the actual Linux kernel right now is not really the most efficient operating system for very fast network packet switching. But I think, in this way, we can... I think this puts the policy in a very different perspective, especially with companies who work on and produce network switches and network gear. Right. I think we barely have enough time to mix it. So are there... Is the phone visible? Okay. Let's try this one. No. Right. No. VGT. Right. So this is the case with very small sizes, with con-cast kits. And this is the different alignment. The alignment of source and destination bufflets. I know this does really look impressive, but for sizes of 13 bytes, it's very difficult to get an actual benefit from Altevic right now. Once the sizes start to grow, this will start to get higher. And right now, let's... I'm not doing a very good job of it right now. I think that small size still is pretty large for Altevic. Yes. Because the sizes are really small right now. But... No, it starts to get higher. But let me... try to make this a little better. Cover... This is called cash. Yeah. It's a little worse. It's still pretty much... Can you start? Bigger size? Is there one... we beat out a different alignment source? Yeah. Sorry about that. We have to finish? Okay. So, I'll be smoky. If everyone wants to discuss it, I'll show you the actual code. Thank you very much. And... Thank you. So, thank you. Thank you.