 Well, in many years ago, we always remember that some hackers, they are going to directly solder some wire into PC, and they can use PC to control various things, but nowadays, all the personal computers are with very high speed interface, something like PCI or USB, and you cannot directly attach a cable into your computer now. So, because what I'm working on is something like software-defined radio, or where has the data transmission, or data processing, making something like FPGA code processor to accelerate the data processing that is much faster than CPU. And so I need something like, I need a way to process to transfer data really, really fast. So now I'm working into this topic. I try to use PCI Express Bus to directly assess the main memory and the transfer data really fast. And here is a brief introduction. So for Nubos, I just explained the PCI Express in a quick and fast way. So let me make sure everyone understand this. Well, PCI Express is a serial link, so it cannot be directly converted into IO. It's not like a parallel bus. It's not like a printer port or serial port. It's high-speed series, so you need some kind of very complicated thing, but that's not important. We can buy it or use a converter. So the first thing we know is PCI Express is something mysterious and very high-speed data link, and it can go, nowadays we have PCI-5 version. So it goes to 64 gigabytes per second. That's blazing fast. And for the slot, the mysterious slot that is since fragile, all the high-speed data flows into it, it's not a very complex structure. Actually, the slot itself is something like a clock, a reset signal, then beta pairs that we call the serial link. So that's all in the slot, also with power supply. So that's it. That is what happened on the motherboard. Why we use PCI Express not Ethernet or USB? It's because the latency of Ethernet or USB 100 or 1000 times higher than PCI Express. Because PCI Express is directly connected to CPU. That's this. That's this one. Well, logically, you can see that the PCI is directly soldering all the primary DDR memory, main memory, directly soldered all the pins out. Logically, it's like you directly tap into the main memory. So the latency is low and it doesn't require CPU in the CPU process. It doesn't require CPU to interact with data transfer. Actually, you can think the CPU, the PCI Express core in the CPU is some kind of switching. It's something like an Ethernet switch. You can think it's a switch. So when you write, when you try to, let's say, the Ethernet card is running like CPU. We don't need CPU to receive the Ethernet packet. The Ethernet card can directly write, receive the packets to memory RAM, directly write it through PCIe bus without the CPU's knowledge. So that's why the PCI Express is the best way to do data transmission or to make a core processing. Because you can see that is your card, your PCI Express device is another CPU, and then you can access the main memory or other cards. Remember, the PCI Express is a memory, it's a memory bus. It's not something like I send a packet to CPU and the CPU goes into interrupt and grabbing data out, not like this. Actually, the CPU doesn't know what you are doing on PCI Express. Well, for the data processing, you can directly read all the data out of main memory, then process it in your card and write the result directly into the memory without CPU's knowledge. So CPU just leave there and check whether you have processed all the data or you can directly make a link. You can use PCI Express to connect the two PC together, derive a transfer of RAM contents from one PC to another. Well, that's very practical and it's ten times lower latency than Ethernet because there's no IT protocol stack or something like that. It's not secure, actually, because anyone can install a PCI Express card on your computer, that means they have total access to your computer. So game cheaters now, they use PCI Express to do cheating because the hardware cheat can directly read memory contents, so without CPU cannot detect this. Yes, that's the perfect way to do game cheating. Also, you even can write that can change your mouse position, be ready to enemies have it. And CPU doesn't know, it knows nothing about this. It's undetectable. Well, I mentioned this just now. Well, for, let's say the, what is DMA? DMA is something like you can use the main memory as you wish, and without CPUs knowledge, so you can directly read or write main memory. That means that's a huge advantage to USB, because USB when you transfer data block, something like a 4k bytes, you need an interrupter or CPU need to grab the data block out to process it. In, with PCI Express, you can take a one terabyte RAM on a server or huge server, and you can use FPGA to fill out all the DDR RAM, all the one terabyte RAM. And that could take something like several tens of seconds, and your CPU just can do other things without any problem. You don't need to do something like every data block you need to interrupt or process it. You don't need to. So that's a huge advantage when you're doing a huge amount of data. And also PCI is very low latency. Actually, I'm using PC as some kind of real time controller, because we know that PC is much, much faster than microcontrollers. It's running on four or five gigahertz clock, and you can do tons of floating point mass in your computer. Then you can control a machine, something like a radio or some quantum physics device, which needs very fast reaction. You can use PC to do something like 15 meg cycles per second to do a very fast responding control. Well, also for robotics, the PCI bus is much reliable than USB, where on a server platform, it's, it doesn't failure. Yeah, it doesn't fail. So when you're controlling something like a car or a robot from a computer, the best way is you try to attach it on PCI Express. Well, the last one is for high interconnect is since we know that the PCI Express is some kind of a memory bus, so you can write some data into memory or write a write data into another card, and the data will flow through the CPU as this. We can take this. You can see that if the data flows through the CPU, it's fast, and the CPU don't need to take any time to process it. So it can be used as some kind of something like a Ethernet switch or package switch, but it's much faster than Ethernet, lower latency than Ethernet. Now, why the AMD, new AMD Ryzen or EPYC platform has so many PCI LANs is for, they are actually those server users, they are using the PCI Express as some kind of, you can think it's an equivalent to Ethernet, but its latency are much, much lower. Also, the logic is similar. Well, to connect something to PCI Express, I use this one, LT5031, where it's available on Chinese Taobao, and also you can email me if you want something like this. Well, that is the lowest, I think this is the lowest cost of high-end Chinese FPGA board. Well, this one, the 3K, the 7K series, this FPGA chip is labeled like $5,000 on DDT, but actually when you buy it in massive quantities, in a massive amount, you can get a price something like $100. That's 50 times different. And the real transfer speed we measured here is 3.6 gigabytes per second. Well, it's because all APJs, APJs are generally they cannot handle the latest PCI Express version, it can do something like PCI Gen 2 or Gen 3, it cannot do Gen 5 currently, or if you want to do something like PCI Express Gen 4 or 5 or APJ, that could be really, really expensive. Those chips are hard to obtain and the price goes up to like several thousand dollars, but why I choose the PCI Gen 2? Gen 2 is old, but the chip is low cost. This one, this board can be bought in less than $200. Anyway. So, PCI Gen 2, does that work? Yes, yes. This one cannot do Gen 4 or 5 because the internal digital circuits in the FPGA are not fast enough. Is it more motherboard compatible backwards? Yes, all PCI Express are backward compatible. You can attach something like Gen 1 to your latest, very expensive gaming rig. What does that mean? Because you have all these components. You have one component that is slower, much slower. Yeah. How does that affect the rest of the system? Because physically, PCI Express is point-to-point lean. It's just like Ethernet. So, actually, the fastest component in your system is the CPU. So, CPU won't be the bottleneck unless you use a very old CPU, and a single device, very slow, doesn't affect the total system performance. You can attach something like 20 cards from Gen 1 to Gen 5 on a Gen 5 server. And the server will run in Gen 5 speed. Yes, because it's packet-based. Physically, it's packet-based, so it's just like Ethernet. The very slow device doesn't affect the fast one. Well, how does PCI Express transfer the data? It uses TLP packets, a transaction-layer packet. You can think this one is like UDP, but it's much simpler than Ethernet. You can only do write. Well, if you want to write a main memory, something like change some data in your game, you just directly send the TLP packet to your computer, and it will change the data in main memory. That's it. It doesn't require something like acknowledgement. You just directly send the write request. Well, for reading, you need to send the read request and just several nanoseconds after you get the reply, the read reply with data. Well, also, PCI Express can do interrupts. It's just like interrupts in microcontrollers. You can interrupt the system kernel, and that could be very low latency. Nowadays, nowadays, more than CPUs, they don't use wires, pins to do interrupts. Actually, they use message. It's something like when you want to interrupt the CPU, you send a packet into PCI Express, or internally, it's a packet in the CPU. And the CPU will receive the packet, see it as an event. A packet is an event. So when the CPU got this packet, it knows that, okay, this is an interrupt that happened. We can process it later. Well, here is the way the software code used to do data transfer. It called REFA. Well, REFA is originally, it's written by UCSD, American University, but the original version has some bug and also missing some important features. Well, originally, REFA can only do direct memory access, something like read or write your main memory, but I add AXI4 logic. What is AXI4? It's something reverse. It's CPU assessing the PCI Express device. DMA is device assessing main memory, but it can be reversed. You can use CPU to access registers in your APGA card. That can enable something like a 10 bus or wireless user defined logic. So I add an AXI4, which basically is registers. So with this modification, you can, the CPU can change settings in the card or change register wireless in the card. And also I add IRQ functions. So the APGA can send the interrupts to CPU. Well, here's some demonstrations about the latency. This is how the APGA card attached on my testing computer. And the laptop on the left, this one is attached to our logic analysis. Well, because I need to do some low level debugging. Well, for PCI Express, it's hard to debug, actually. So writing the, modifying the software took me a very long time. But now it's working, now it's working perfectly. And it also requires some kind of special tools. You want to do something like a very low level of physical debugging, because the data is really fast. So even you have a high end oscilloscope, even you have an oscilloscope with tens of gigahertz bandwidth, you cannot, you cannot record enough time for finding the failure. Something like if it fails once a day, you need a huge amount of memory to record it. So I got this, I got this one, the Keysight logic analyzer dedicated for PCI Express. This is called, when new one is something like $150,000. But I managed to connect this thing from eBay garbage. And after I connected something like, I connected something like tens of components, I put them all together and now it's working. Well, it is for very difficult low level bugs, something like if it fails in the very bottom layer, cannot be analyzed by FPGA debugging tools, I need this thing, yeah. Well, also setting this thing up is, you have time consuming and you need another computer to handle it. So I needed something like three computers, one for compiling the firmware for FPGA, and another for running limits and testing the PCI Express card, and also the third one is for running the logic analyzer. Also, the logic analyzer itself is PCI Express, the green cable you see, it's attached to PCI Express on a desktop computer. This is how the logic analyzer, the logic analyzer looks like. Basically it's the package running through the PCI Express LAN with the debugging tool by Keysight, you can see all this happening. Normally in 99% of my working time, I don't need this, I don't need it, but if I run into some common cases, something like someone reported a bug, send an issue to me, I need to debug it with this thing. Well, this is the latency measurement, it's 200 hertz, once five millisecond, the FPGA send an interrupt to the Linux operating system, and the operating system will give the recorded time stamp. So with the time stamp values, I can check the control latency of the PCI interrupts. Well, the test read is something like I3, with all CPUs calls running to 100%. How I reach 100% is by using 17. I do a 17 benchmark and the saturated CPU, and then I measure the latency. Well, this is actually very impressive, it's something like the nominal value of the interval should be five millisecond, but the actual measure the value is minus one or positive one millisecond. So basically that means the CPU can answer the interrupt in less than one millisecond guaranteed. Well, even with CPUs, all the CPU calls are saturated. Well, that means if you control a robot in something like 200 hertz, that could be very stable. Well, if I lower the CPU usage to 17%, the latency will go down to positive or minus 100 microseconds. 10 times slower than the saturated CPU. The peaks, actually these peaks are caused by loading their else into application. Well, this is very interesting if you keep everything static in your application, you have an application and the receiving interrupts from PCI Express and do some data processing. Without any memory dynamic allocation or dynamic module load or unloading, something like if you don't try to load that DLL, the latency can be very, very stable. But if you try to load this application or load that DLL into your system where that can cause the latency goes high. But by it's not very high, it's something like four, three or four hundred microseconds. It's not as fast as embedded controllers since we are running desktop limits. But I think this is good enough to build something like a motor controller in one kilohertz. The reason why I use a PC to control motor is because I can run various experiment and checking the waveforms or doing various fancy stuff in PC. It's much faster, easier to change PC applications than doing microcontroller. If you want to debug cable jpeg, you can record everything into memory and check your status. So basically that's all. Any questions? How long do you think to put this together? Your logic analyzer? My logic analyzer is six months and this one, the PCI spread code to me another six months. But since at that time it's COVID, I have final time debugging at home. Is that your DLL? No, no, no, it's not my DLL. It is something like I want to do signal processing, SDR acceleration. So CPU is not good at doing this stuff. I want to process all the data in APGA. So at the very beginning I tried to use something like Xilin's PCI express DMA, XDMA, but it's full of bugs. Anything commercial without open source is a disaster. So I walk into the source solution. But there's a hidden bug. Hide the big, big into the river and it can just interrupt your data transfer. It can break the data transfer at a random case. So I took something like two months to hunt it down. But now it's very stable. I can write something like 3.6 gigabyte per second, 7 by 24 hours and for one month without any errors in the data stream. You mean the PCI interface? You mean the main one. The logic analyzer, that one is called the PCI interposer. We use it to amplify the very weak and high speed signal and pick up by the logic analyzer. If I don't need to use the logic analyzer, I can remove the bottom part. I can remove the APGA both into the computer. Well, this one just has some minor issues. It's not a standard PCI card form. It's because I draw the PCB design really really fast and I forget to check the outline of the standard PCI card. So I leave it there. Maybe several months later I will try to reform it. I will try to change it to standard PCI card so I can sell it. But this one is really good for doing development. Well, because this card has already used all transceivers, high speed transceivers, so I cannot attach something like a 10G internet to it. If you have a better APGA, you can try to do something like a real-time graphics or APGA.