 record on your... okay, there we go. So let's look at something fairly basic, right? If we create a structure, so we're going to call this structemo, right? And we're going to say that this structure has a bunch of elements. We're going to give it... let me just... actually don't need the h-file for this. I'm going to re-sync this thing with a remote host because I'm deploying on a remote host so for the second, so it'll find all my header files or it should anyway. It's not wonderful. Okay, I'm going to close this and just go with using another project that actually works because it's easier and you're sick. Okay, so we're going to take a very basic structure and we're going to add a bunch of variables to it. So we'll put in a 32-bit and we're going to call this our marker. We're going to put in another 32-bit and we're going to call this our value and then we're going to go and add a char string 20 just to add some padding to it and then we're going to go and add another value 64. So you've got a structure there that's got a bunch of elements in it. Now, if we create a further structure that we'll call this our base structure and we'll take struct demo, demo struct and we'll make this, I don't know, call it a million, call it 10 million elements. And the first thing to notice here is that when I'm doing this, my demo struct there at the moment is a static allocation. I'm not making that a pointer, though it does collapse to a pointer because it's what arrays do in C, but I'm not going to go and allocate a million copies of demo because allocation can actually be pretty slow, though in a situation where I'm just going to be accessing my base element once and doing one calock, it probably wouldn't make a difference, but for now this will work just fine. So I've got my two structures there and struct base, we'll call this bs for base struct, one on a size of struct base. I need to include some things here. Thank you. That's a good point. Somebody should have told me that I was not sharing anymore. Thank you. Let me fix that. Okay. There we go. Clay, just do me a favor. I don't know, can you see the waiting room or not? I don't think I can quite yet. It's not a problem. I can keep an eye on it. I've got another screen, so I'll just keep an eye on it. Just in case people are trying to join. Okay. Can you guys see my screen at the moment? Yes, as we can see now. Okay. There we go. That helps. So I've got my demo structure here and then I've got this base structure that contains my demo structure. The logic behind this will become clear in a bit and I allocate for my base structure. Now, the first thing that I want to do is say, for example, I've got this table, ds over here, which is a table of demo structs. I'm going to be wanting to search something in that demo structure. I'm also going to be wanting to add stuff in and out of that structure, and I'm going to need to be able to check which ones are allocated if something's been removed, et cetera. So I've got a marker value in there, and it's a 32 bit yet. Use us a memory to have it there. But if I take that marker and I set it, something like four into i equals zero, I'm going to just do this quickly. Make it defined just to make it cleaner. Demo values or demo t, I get 10 million. So I've got my demo t value there, 10 million values. If I do something like demo t, i plus plus, and ds dsi, marker equals ones complement to zero. So it'll set my marker over there so that all the bits are set in my marker. My marker is set, and I'm going to use that to say that this is an empty value. So find empty marker ones complement to zero. Now, this here, if you look at this piece of code, that's going to iterate 10 million times through the demo structure to set that marker. That's not efficient, and it's a pretty slow way to do things. So Andrew, no comments on this, because I know you've done this. Anybody got any ideas on how I can optimize that? What are the thoughts on how I can pull that? Keeping in mind that that marker is a 32 bit value, it's not a simple 8 bit value. And I don't want to modify the rest of the struct so I can't use a simple mem set. Ideas, thoughts, but shifting is not going to work. You still got to iterate 10 million times. How are you going to avoid that level of iteration? Any ideas? Come on, guys, talk up. Clay, yes, you could slice it, right? You could slice it and thread it. That's one option, which effectively means that I start a whole bunch of threads and I fill it in parallel across a number of CPU cores. The problem with doing that, when I'm only dealing with 10 million, is that I'm going to take a huge hit on starting those threads. I'm going to take a huge hit on joining those. I think I've let everybody in at the moment, Andrew. Everybody's admitted. So that is going to take a bit of a hit. And with an array that's only 10 million long, probably not going to see any benefit. So need another way. Any other ideas? Okay, so I just want to pull something up here on my other screen. The other reason that you use a live project to write demo code in because it gives you with code available that you can refer to off the top of your head. Okay, so that for loop is the one way to do optimization. Well, you're doing a straight for loop. It's nice. It's easy to read. Everybody understands what this for loop does. As I said, it's slow. So now we're going to get a little bit fancier. First thing I'm going to do is I'm going to create a static constant. Use 64, eight values. And this is going to look something like this. Yes. So as I've struck demo, just because it'll make the code slightly cleaner to have it like that. We're going to call this demo offset. Sorry, I'm being such an idiot tonight. Okay, so what I've done here is I've created this array of the size of the structure that I'm going to be going through. And at each point in that array, I'm creating an offset. What that means is, is that if I were to point at something like this, if I say, well, actually, I'll show you in the next function, it's going to be easier. But effectively, that is going to be used as a bunch of offsets. What am I going to be using those offsets for? The trick with that is that I need to be able to load my value at a particular point. So now I'm going to create another function. And I'm going to call this function static in line scatter marker. And I'm going to say that I want a start. I'm going to say that I want a probably a size t size. I want a, you'll see if I've forgotten any in a second that I need. So I'm creating this function. And now what I really want to do here, oh, and I'm just going to see it showing me and it helps if the structure is above my constant. So, oh, and I want to know what am I, actually, I can scatter with just a U32. So the first thing that I'm going to do is I'm going to say that I want to set markers. And because I've got a fancy CPU, I want to do this eight at a time. So if I do something along these lines, and I'm going to need to include something here, the intrinsics file, very useful when you're doing optimization, we're going to call this our scatter value. And we're going to use this really strange thing. And I'll explain this in a second. Now, this is going to be kind of a strange cost because of just to keep it safe when I'm using one's compliments, etc. What this is going to do is it's going to create a vector and it uses a 512 bit wide register on a modern CPU that basically contains eight elements. 832, no, sorry, 16, 16 by 32, 1632 bit elements. I only actually need eight though. And the reason that I need eight is because I can't scatter more than eight values at a time because I can't load the pointers more than eight at a time because I can't go more than 512 bit wide in the register and pointers are 64 bit. Now, what I've just said probably makes absolutely no sense to a lot of you. So if what I've just said makes no sense, say so now and I will explain it. Anybody dies, I need some interaction here. It's very hard to teach if I'm not getting any feedback. Yeah, I think, I think basically it's a very new concept to us. So it just sort of like watching to see how it goes. Okay, so what I'll do then is I'm going to expand on what I've said in this function and then I'll take you through it line by line. Does that work for everyone? Okay. But effectively, the important thing going into this to know is that on a modern day CPU, you have multiple cores and you can thread to multiple cores. So you can split your data up amongst threads. But there is something else and it's called SIMD single instruction multiple data instructions. What these do is they use a vector to do this and you can't actually work on individual elements of a vector at a time. It's always going to work on all the elements of the vector at the time. But those vector instructions allow you to use parallel processing within a single CPU and you're operating on multiple elements in what is effectively an array that you load into a register at the same time. Now in this case, I'm going to be setting a bunch of memory. So what I've done with that last command is effectively created an array in a hardware register of 832 bit values with all the bits set. This is a once complement. No, I'm going to work on multiple 32 bit values at the same time without actually threading it. You'll see in a second. The next thing that I wanted to do to make this work, because now I've created data that I want to be able to spew out across the memory, is I need to know where am I going to be throwing this data? The answer is I'm going to be throwing this at my pointer and each thing is going to be offset. So I need to create another offset vector for my pointer. We're going to call this offset vector 12 load you for demo set. This offset vector is going to load the values from here into a 512 bit wide vector. So I've now got 8 points in this kind of hardware array that have all the offsets. Now I need to be able to now throw this value into the right pointers. So if I do something along the lines of this, I is less than, should actually be called elements, I is less than elements, I plus equals 8. Notice that I am incrementing elements here, not by one, but by eight. The reason is I'm working with eight values at a time. I then do something to the effect of mm, let me just check that it is mm512. Scatter, i64, scatter epi32, that's it. Now I'll explain that in a second. Start, which is the start pointer plus mmos, which is my size of my structure, times my offset and use my offset vector to offset the pointers, scatter the scatter value and scale at one. And there should be a static inline void. And what this is going to do, and just as a safety in case things aren't aligned, if i plus 8 is greater than elements, right, don't set the last couple just for safety. So what is this doing? If you imagine that you have eight pointers and they point to eight different locations, they're going to be offset by the size of the array. And you're then going to throw data into all eight pointers at the same time. How are you going to call this? If you look at this over here, this for loop that I've got, I refer to BSDSI marker. If I take that however, and I say something like scatter marker, the pointer for DSI marker, and my elements will be demo t and my offset array is demo offset. I just want to check. This should be a, this, you want to cost this because it's a define. Let's see what is it wanting about here. With IDE is just wanting at me because the value never changes. It's a static value. And I also want to change this so that I'm loading from what I'm actually parsing in to make this function more generic. There we go. So this here is now going to run through that 10 million element array, but it's going to do it at 10 million divided by eight iterations. How does this work, right? If you look, your pointer for marker, which is in this demo structure over here, plus the size of the demo structure is going to take you to the next marker in the array. In the same way, a pointer to value in the first element of the structure plus the size of the structure would take you to the next array element pointing at value in that array element. So you're effectively pointing at a part of your structure and then jumping the entire value of the structure, of the entire structure at the same time to always end up at the same, the same element within the structure. Does that make sense? And a lot of this is going to be intrinsic based, yes. But what I just said actually makes sense about the pointers. Effectively, if you, I'm trying to think of an easier way to explain this, if I have ds0.marker, right, and then take a pointer to it, this will be 32x. So x is now pointed to my first marker, right? If I do something to the effect of y equal, and I'm going to cost this deliberately in a fairly fancy cost just to be 100% safe, void store, void store x. So take the value of x and size of struct demo, I'm missing here, somewhere my cost is not right. Oh, I don't want to dereference that. What that'll, why when I do that is now the equivalent of bsds1.marker, because I've added the value of the entire structure. Do you guys follow how that works? Thoughts, talk to me guys, because the stuff gets complicated and it gets a lot more complicated than this. And if nobody's talking to me and I don't know what you guys are getting and not getting, you're going to get lost fast, so somebody speak up. Ahmed, what sort of questions have you got? You said a bit, that implies that there's elements that don't make too much sense. Okay, so to be honest, to me, intrinsics isn't something that is as much of an explored area, so try to sort of like understand. Okay, but forget the intrinsics for a second, right? Yeah, look at the pointers first, because the pointers are where this starts, right? The intrinsics will go into more details. What I need to be sure is, are you understanding the alignment of those pointers and how I'm incrementing the pointers to get to the next value in the array? Yeah, you're just basically what you're doing is you're taking the size of the structure and you're sort of like, what are you doing? You're sort of like pointing to the next U32 block. Yes, it's the same way if I did this. So now A is pointing at value, E will now be pointing at ds1.value, not marker, because I'm incrementing the entire value of the structure. That makes sense. Yeah, you're sort of like calculating next pointer reference, something like that. Correct. Clay, am I making sense to you? Yes. Okay, that's good, because Clay is not a C programmer, so if I'm making sense to him, the rest of you should be following this. Sorry, Clay. Oh no, it's all taken. Okay, so that's what I'm doing. I'm calculating this offset array. Now, we get to the intrinsics. This offset array is effectively giving me an eight wide set of offsets to the following value of that element of the struct all the way down, right? So eight at a time, right? Sorry, I have a question. Yeah, if it's okay, we proceed. What is the advantage of making the calculation instead of directly pointing to the next element of these structs, for instance? Okay, remember when I said I can't work in vectors on single elements in the vector at the same time, I have to work across an entire vector. So I've got to load the vector eight values at a time. I've got to act on that vector eight elements at a time. You can't say, give me vector element one, vector element two, vector element three, etc. And any intrinsic that you see that does that is actually just an abstraction layer to do a memory load. They set up an array and they pull in the whole array. Okay, so it's like you're avoiding being repetitive with the code. You'll understand this better in a second, right? Once I set up these eight offsets, when I take this over here and I load the offset array, let's look at what this instruction that I've given it does as an intrinsic instruction. It says take the start pointer, which in this case, I'm incrementing by eight times the size of the structure. So I'm jumping eight entire structures across the array at a time. But the first time, this is going to be spot plus zero, right? Because the multiplication is going to come out as zero. It's going to take that pointer, then it's going to apply the offset vector to give it eight separate pointers. And then it's going to take this value out of my scatter vector. And lane one of my scatter vector is going to end up at the first pointer, lane two is going to end up at the second pointer, lane three is going to end up third pointer, etc. And it's doing this in a single operation, right? Okay, I think I understand. So it's like the calculation is doing something that typically can't do with a typical structure reference. Exactly, because normally you'd be doing this one element at a time. Now I'm working across eight elements at a time. That leaves us eight times faster theoretically than the original loop. It's massively faster because I can work at eight at a time. This is the simple one, right? This is where you start actually setting something. And you know that you're spewing it all the way down the array, nice and simple, right? Okay, yeah. But before I move to the next step, I want to make sure as anybody got questions about this, that they're not quite grasping because it's going to start getting more complicated from here, believe it or not. So questions, speak now or forever, hold your peace. Yeah, I get more from the performance thing because I mentioned slicing before, slicing up into eight different processes or fork it into eight different processes. Is the vector math, when you say one step, that means it's only one operation and you don't have to necessarily worry about dealing with threads and making sure that everything goes back in order properly? Exactly. It's a single instruction. This mm512 load you epi64 is a straight mapping to an assembler instruction, which if memory serves me correctly, in fact, I can tell you exactly what the instruction is. Give me hold a second. I'm just pulling up the intrinsic mm512 load you epi64. It's using an instruction in the new Intel CPUs called vmovedqu64, which is vector move double quad word unaligned 64 bit values. It's a single instruction that is telling the CPU to act on eight elements in a vector at a time. And that's a heck of a lot simpler to do than it's like you're taking, instead of having eight separate buckets digging out sand at the beach, you have one bucket that has eight small buckets glued together. So acts as one operation. Exactly. You don't end up with locking issues. You don't end up with thread join issues. And it all happens at exactly the same time on one core. So if you've got 24 cores and you thread this, you're no longer operating on 24 elements as you would be with threads. You're now operating on 160 plus 30, 192 elements at a time. You know, it's, it's, it's a whole other ballgame when you start working with vectors. But it gets tricky. And I have other questions about the practicality stuff or why you would choose different, certain types of hardware, but I don't want to take away from the coding demo. I can ask that at the end. Okay. Yeah. We'll, we'll talk about that because one of the things to know about optimization is when you're writing code like this, you are optimizing for specific hardware platforms. And it's, it's an important point to make at the start. This code requires a fairly modern CPU because it uses an AVX 512 instruction set. You could do a smaller version of this using AVX 2, which works on 256 bit vectors. Most CPUs will at the very minimum do 128 bit vectorization. But in this case, we're going to work with 512 purely because some of these instructions, for example, when you start doing gathers and scatters, that requires 512, but it's part of the AVX 512 instruction set. But as I said, this is the simple version, how to scatter it. Now let's say I want to start searching an array for an element. Now I'm going to have to get a little bit more fancy of one other thing to note about these functions. You'll notice that I declare the function static inline. What that does is when it compiles it, it takes this code and it literally inserts it inline into the function. The reason for that is because you really, really, really do not want when you're trying to optimize stuff to be calling jumps that are jumping to other functions in the code. That's an expensive operation. If you inline it, it doesn't have to jump it. So that static inline tells it, take that function when I compile it and shove it straight inline into the code wherever I call it. That makes sense? So that depends on the type of code that every day that you're writing for the everyday user just to answer your question in the chat. What you would very often do is define this by precompiled directives. So effectively, you would say that when you compile it, the precompiled directive will test for certain instruction sets and apply either the optimized version or the non-optimized version. So when you give someone the software and they compile it, it's compiling based on what hardware they have and it will detect that at compile time. It just means that you're writing a lot more code because you're writing for optimized and non-optimized. But now let's look at how we go about doing a search. Static inline void, we're going to call this um, table search, void star start and we'll use our elements again and we've got hdash u64 off array and we're going to want to search it for a value. So we're going to set u32 needle. We're going to search our haystack for a needle. First thing that I need to do again is I need an array of the needle values because it's the array of the needle values that I'm going to be matching against. So we'll call this our needle vector and mm256 set 1 epi32 and vector is incompatible. So that creates my vector of stuff that I'm going to match. Then I'm going to load my offset vector again. Again, it's the same code as the previous one 64 off array. So I've got an offset array. Then full int i equals 0, i is less than elements, i plus equals 8. Again, we're iterating 8 at a time and I'm then going to do something along the lines of we're going to call this our element vector mm512 i64 gather epi32 and this is from our offset vector using our start pointer on the scale of 1. What this instruction does is the exact opposite of the scatter vector that I used up here. So instead of putting stuff into memory, I'm now pulling it from memory 8 values at a time into this vector called element vector. Then and this is where things get a little funky. I'm going to create what's called a mask, right? It's basically a binary. It's a binary result mask call it result 256 compare equal epi32 mask element vac needle vac. What this does is if you imagine that you've got eight lanes in that vector because it's a 256 bit vector because it's 832 bit values, you've got eight lanes, and then you're comparing it to eight lanes in the other vector, the needle vector. Any lane that matches, it creates a mask and the mask is an 8 bit mask one bit for each lane that matches. So if you've got 832s at the top, 832s at the bottom, they all match every bit in that mask vector is then going to be set. If nothing matches, none of the bits will be set. Make sense? Okay. So in essence, it's just mostly used for tracking what you're trying to do. That's it. No, it's used for telling me which elements match. Okay. Great. If you take that right now, the other thing to know about masks, by the way, is you cannot operate on them directly either. So what we're going to do is we're going to create a 32 bit value here, well an integer value. And we're going to call those mask result. And we're going to say something to the effect of if mask result equals convert mask 8 to U32 result is not equal to zero. That means that I have had something in those eight values match. That means I've actually found a match to my needle in that particular eight segments of my array. I can then get clever because keep in mind, I don't want to have to iterate through every bit to find which of those eights actually matched. I want to be able to do this in a single step. So if I return, and this should be an integer return, if I return int i plus built-in CTZ, and I'll explain that in a second, and then I want to start plus equals O. And another thing, you don't want to be calling size of constantly either. So we're going to create a loop increment which will be a, it'll be a long, long because it's a pointer, it's a 64 bit pointer. Increment equals eight times size of struct demo. So that's my actual loop increment start plus equals loop increment. And just checking here, what is it running about? Mask result, this mask result is always a, it's a 32 bit unsigned. There we go. And otherwise I am going to return minus one. So what have I got here? This is going to grab eight values at a time in the same way that I scattered it in the previous function. Then it's going to do a comparison on all eight values that it's grabbed. That's going to produce a bit mask. I'm then going to take that bit mask, I'm going to convert it to a UN32 because I can't operate on masks directly. And then I'm going to take my offset of I and I'm going to return I plus the number of zeros that are set in the bit array before the first value. So effectively, what ends up happening here is that I gives you an offset to the nearest eight by counting the number of unset bits that are produced in that mask on one side of the mask. I get the offset to I of those eight values. This function will produce a return of the offset of my array using one eighth of the actual instructions that you would use if you were iterating normally. It's eight times faster. That makes sense. Basically it will mean you're using bit level arithmetic to sort of make it much shorter. Yes, but I'm also operating on eight elements at a time because I've loaded all eight. So let's look at it this way. If I have something to the effect of UN32 demo eight equals one, two, three, four, five, six, seven, eight. Const UN32 and we're going to call this our comp for a comparison in our needle. This is just an example and I go something like zero, four, eight, four, five, six, seven, eight. When I compare these two vectors, which it does in a single operation, what I'll end up with, let's just do const uu8. We're going to treat this as if it were bytes just for the purpose of this demo, but this is actually happening in bits because result would be a binary mask. I'm going to end up with zero there, zero there, zero there, one there because it's a match, one there, one there, one there, one there. So I've got eight binary bits. My first match is going to be this one. It's the first match that I'm looking for. So if I'm iterating and I've got eight jumps at a time, one off to another. When I take i and I count these bits and I take i plus these number of zeros, I end up here, my offset of my array is going to be i plus one, plus two, plus three, then that's going to be my end up offset. Make sense? Clay, thoughts? I don't think it hurts too much. No, but it makes sense. And as I said, where this gets so tricky is the fact that you can't operate on these things directly. You're never operating on your vectors as single elements. That's why you've got to start doing things like using instructions to count your number of zeros, et cetera. But the simple fact is when you look at code like this, you've optimized going through an array of something, searching for a particular value from 10 million iterations or 100 million iterations down to whatever that is divided by eight. That's a game changer in speed. It's an absolute game changer. And I'm going to get a little bit more fancy, believe it or not, because this is where a lot of the start of optimization starts. This is effectively what you call loop roll up. You're effectively taking a loop and rolling it up as compactly as to remove the number of iterations, the number of instructions, cut it down, do less work within that loop. Right? Once you... No, no, go ahead. So basically, for it to also make a lot more sense to us, what would be like a use case, a typical use case for such an operation? Okay, so if I look at something like an op table in networking, let me pull up some actual real code here. So if I'm inserting an op entry into an op table in networking terms, because I've just had an op packet, I need to get done with that as fast as possible and go back to processing normal packets. So I'd use that to search my op table for the first free entry so that I could do this really quickly. If I need to find an entry in the op table, and I need to do this really quickly, I'd do a very, very similar thing used in really high-speed real-time operations is what you typically do. That makes sense. I just want to paste this link to somebody because they're asking me for the link. So it's used for really, really high-speed matching in tables. You could ask me why I'm not hashing, right? Because very often this type of high-speed table matching is done with hashes. The problem is that if your table is only 65,000 entries wide and you hash, hashing takes time. Hashing can be fairly quick, but it's also not terribly fast, right? It uses more processing. And on certain tables, this is going to be faster. The other thing about dealing with hashing is that you've got collisions in a hash table. That means that you've then got to start creating link lists at your hash offsets. And that typically means that you're into a world of dynamic memory allocation. And you really don't want to be in the world of dynamic memory allocation when you're trying to write really high-speed code because allocating memory on the heap is extremely slow, extremely slow. So you want to be able to have a table that is sitting on a stack or on the heap and allocate it only once. Where you're not dealing with dynamic memory allocation, and which, as Clay points out in the chat, can lead to memory leaks and all sorts of other things and other security issues. If you've got one static block of memory that you're always operating on and you're just searching it, you've got less security issues as well and less leaks. So that would be one example of this. But we can take this a step further, right? Imagine we were doing packet matching, right? So if you look at your standard IP packets, you've got an Ethernet header, you've got an IPv4 header, then you'll have a protocol header, and then you'll have data. And you want to match against the entire header stack in one go. I don't want to have to say, what is my source MAC? Then go, okay, well, my source MAC passed, so what is my source address? So what is my destination address? So what is my source port? What is my destination port? And do this in a whole bunch of comparisons. Because that is slow. It's a lot of operations, a lot of fields to compare, right? So what if I say something like this. If I take, I just want to include some stuff here, just so that I've got easy structures. Let's say I have a buffer, and we'll make that buffer 64 bytes wide. So 64 bytes, the top part of the packet, right? Which by the way, 64 bytes, 512 bits, is enough to contain that packet, provided you don't have any extra option headers, etc. But for this purpose, just for the demo, we'll assume that you don't, you do other things to deal with it, if you do. Then it buffers. You're taking away my first question. How are you going to deal with IPv4, since those headers are variable? How are you dealing with that extra option? But dang it, you took that away from me. I'll get to that. I'll get to that, right? Because there's a trick to that. But if we say that my buffer is going to contain all my packet headers, effectively, you could align those packet headers over that buffer. So something like this, struct rteether header, ether equals buffer. So ethernet header, start of the buffer, then struct rteipv4 header, ip equals void star ether plus one, increment over the ethernet header, struct rteipv4 header, tcp void star ip plus one. We'll assume that this is an IP header. Normally, you would process, I mean a tcp packet. Normally, you would process to check whether this is a tcp or a udp or whatever it is and align the right structures over there, right? This would give us, effectively, most of our packet in this array. And I could set something. So let's say I want to set a source MAC address of all Fs. If I were to do a memset ethernet source address address bytes, six is the length of it, all the source address, and what you would see is because, actually, it makes us the dest, the destination address, the first six bytes of buffer, because the destination address is first, would then be equal to 0x ff. So you get a whole bunch of fields that are all mapped by those structures, right? Now, once you've got your actual packet and you want to match certain elements of that packet, what you do is you create a mask, a binary mask, of what you want to match using the same headers. So let's say, for example, in my IP header, the only thing I want to match is an IP source address. And it's a 27-bit wide address within the network. So if I look at the mask that I want to create, we'll just initialize all of this to 0, right? So buffers all 0 right now, IP source address, which is what we want to match, is equal to, if you just check something, I just want to make sure that I've got my indianness right before I lie to you guys. Your mask would be something along the lines of this. I'm going to swap it because there's a little indian machine, as most Intel boxes are, tilde u320 shift shift 32 minus 27, because the 27-bit mask, that is going to create me a binary mask that's got the 27 bits in it set. Everything else in that whole buffer is going to be set to 0. Now, if I were to take that buffer, which is now effectively a mask, that's only going to match that one structural element that I actually want to match. And I take my packet and I do something along the lines of, as I'm going to write this basically in pseudo code, but we'll say this here will be a match vector, will be something like mm512 loadU. We'll generally load this in 16-bit, but doesn't really matter whether it's 32-bit because you're masking the whole damn thing and you're looking for an exact match. And we'll load, packet would be our original packet in this case. mm512, mask vector will be mm512 loadU epi32 and this will be buffer. So I've now got my packet and I've got my mask. I'll also have another vector that is what I actually want to match, just the values that I want to match. Mm512, this would actually be called my packet vector. Match vector, we need epi32 match. So match and packet are basically my actual rule that I want to match, which I haven't defined here but doesn't really matter. Packet is the incoming packet, mask is the masked packet. I then can do something along the lines of, let me just find this. Second, just so that I don't lie to you guys, I can do something like this. Um, dash dash, mask 16. Now remember, because I'm working in 32s now and I'm working 16 of them because I'm working across the entire 512 bits, I need a 16 bit mask to actually match this. mm512 compare equal epi32 mask. Before I do that, dash result vector, result vector is another 112 bit vector. Result vector equals mm512 and epi32. And we're going to end our packet, our packet vector to our mask vector. That does a logical end on the mask against the packet. What that'll effectively do is eliminate anything in the packet that I didn't care about. So it's just going to zero it out and it's only going to leave what I needed to match in there. At that point, if I then take this and I compare the match vector, which is the rule that I actually wanted to match with the result vector, and I can then say if CVT mask 16 32 convert to something I can actually read result mask 0x, the full 16 bit set. It's giving me grief because packet vector isn't defined, etc., but effectively return match. Basically, what that does is it loads the whole damn packet header structure into one vector. It then does a logical binary and against it against a predefined mask checks whether or not it gets all the bits in the mask set, which means everything matched nicely. If it does, I've matched the rule. If it doesn't, I haven't. That allows you to process packets. Instead of doing 20 different comparisons, I can compare multiple fields across the whole header structure in effectively a single operation, rather than if this, if this, if this, if this. That is what allows me to process 200 million packets a second through a firewall against a rule set of 500 rules long. Does that make sense? Explain the rule set that you said some of our rule sets. How does matching a rule set actually go? I'm sorry, I mean, again, not much of what you guys are trying to understand. Okay, so what I'm going to do is I'm going to find a packet dump quickly. And why are you doing that? I can talk about firewalls. Yeah, so you're familiar with what a firewall is. Does packet filtering, you have them in like your home router or like a corporate or enterprise firewall? Yeah. Yep. So what a firewall does is they evaluate, they evaluate traffic based on IP headers. So the source address, the source IP address where the packet came from, destination address where the packet's going to, protocol and some, and some sort of port. So like the source port and the destination port inside that IP header or technically that's the IP header. And then the protocol that it's supposed to go across. So I have a question about that. If you're talking about like the 500 rules, I think basically there still has to be some sort of looping still, isn't it, involved? You would do one match per rule, right? So yeah, this is why I shock that you're looking at and it's a dump of a test packet, right? What you'll see in this test packet, you start over here at your ethernet, right? It is the first six bytes of my packet of a destination MAC address. Then I've got a source MAC address. You can see the highlighting happening there. Then I've got my packet version number. I've got my header length. I've various fields here, right? And this is always, I guess, a part of the rule specification, I guess. Exactly. So what I would say is I want to match just the source address. So I zero out everything else in the mask and I create a binary mask just for what I want to match. When I take my actual packet and I end it against that mask, I end up with something that should match an exact copy of my rule itself because of the binary logic behind it. Instead of having to compare, God knows how many fields there are in here, a lot of fields, I only have to do one. I've just got to compare the result of that end. Yeah. And it is in the binary footprint is what you're dealing with the exact like structure and all that. Yes. I'm dealing with a binary footprint and I'm doing it very, very wide across five and 12 bits, despite only having a 64 bit processor. Makes sense. Makes sense. And it's this type of stuff that is key to when you want to deal with high speed packet processing. Now, Clay asked me an interesting question about variable length headers, right? Clay, the answer is going to surprise you here. And this works very well with v4 and v6, except for where you've got exceptions which you deal with on kind of a slope off. What we typically do with v4 is for each rule that we have, we actually create multiple static masks and matches that are offset to the various permutations because there are a limited number of permutations in v4 in particular that you can have. For example, your offset will change with if you've got an extra VLAN header sitting between your Ethernet and your protocol. So that's an extra rule definition in memory that I can match against. When the packet comes in, I say, is this a VLAN packet, isn't it? If not, match against this. If it is, match against this. And it just becomes which rule and which mask do I load into those vectors as I go. So effectively, you actually have multiple rules that are catering for those options. Does that make sense? Yeah, so you're making a decision of which vectors to look at. Yep, makes perfect sense. This is how we get the performance that we do. Because I've cut down my matching into very, very limited operations. And that's another kind of key optimization principle that we work with. Cut those operations down. Would you apply this kind of heavy vectorized stuff to everything that you're doing? No, it just makes sense. You're not going to go and pause your JSON config file, which is going to be a couple of hundred lines long, using 500 lines of vector code because you load your config once. You don't care if it takes a while. But when you're dealing with stuff like real-time packet processing, you'll optimize that section of the code. So your optimizations are chosen when and where they are needed. If you tried to optimize absolutely everything, you'd end up with an unreadable code base that was completely unmaintainable. So you optimize based on what you really need to optimize. And that's very important. Obviously, this is really deep optimization. There are other things that you can do. And I'm going to show you some of those here on the kind of more simplistic optimizations. So let's go back to that. If you look at, so I'm just going to delete some of the stuff out here, you've got a bunch of values that you want to test something against. So let's say, for example, I have something like this. And I've got a variable here in line with respect to some way just int y equals 10. Okay, we're just declaring y just for the sake of it. I could pass it in or whatever. And I do something like if y equals 10, do x, right? If y equals 20, do y. If y equals 30, then I'm going to do z, something like that, right? You could write this as a case statement, which means it's going to match one after the other until something matches. And this could be different things that you're testing. You could also do if then, else, if then, else, if then, else, etc. The advantage of doing this in a case statement is going to be that you can hit your first match that you actually want to match break out of it and not do the rest of the tests because the tests are expensive. The problem with using case is that you can only do a single match on a single element in a case statement. So let's say I had int, I cannot do something like case y, case, that would, so if I did case 10, do z or whatever and I break these. So in a case statement like this, it says if y is equal to 5, do this break, stop. I'm not testing any further. If y is equal to 1 because it wasn't equal to 5, do this stop. I'm not testing any further, right? So it lets you break. The problem with that is if I now also wanted to test z as well, I would have to then go something like either I've got to put another case statement in there or I've got to go with traditional if and else on both comparisons. What you've got to learn to do though is in a scenario like that where you're testing is break out of that cycle of tests as early as possible. So look at this. If y is equal to 5 and z is equal to 4, do x if else if da, da, da, da, da, da, da, right? The problem here is that wallows may seem to stop but can actually end up continuing, etc. What you really want to do here is say x and then go to below the if set. So you insert a go to the jumps right to the end of the function. Something like this. I'm trying to think of a good example of this. Okay, let's fight this probably into x into y. So if x is equal to 5 and y is equal to 4, print first match. Else if x is equal to 5 and y is equal to 3, print f second match, etc, etc, etc, if, if, if, if. Now, in this case, because you're doing else's there, it's probably going to stop the first one. But if you don't want the else's there for various reasons where that can occur and you've got this and you're stacking if statements one off to another, you're going to know that your first if statement up here, if you match it, you do not want to handle the rest. So if I do that, this is a jump straight over the code and it's going to jump over all of this other stuff. You can use this go to to effectively skip over code, use it with extreme care though. This kind of avoiding branching with go to is probably a bad example of where you're going to use go to but effectively your optimization can often come down to how do I only process the code that I really need to how do I jump over code that I don't need, etc. Now, now, if you're ever given a project to do for university or whatever, and I've learned this way back when when I was 16 at school, do not put go tos in there. Your lecturers will have our favor because it breaks code structure. It becomes very I'm jumping here, I'm jumping there, I'm jumping wherever, and it becomes quite hard to trace that you can get immense immense performance benefits by jumping at the right point to avoid unnecessary instructions. Does that make sense? If anybody got any questions about that, play probably yes, twos do tend to scare people because as I say, they do break code structure and Brian, yes, go to is go to skips code. It stops processing. It allows you to jump right over things. It's way faster, but it's also dangerous. You don't use go to and this you have good reason and you've really analyzed what happens when you jump over code. This becomes particularly relevant when you're dealing something with something that's for example doing locking. So let me let me show you from an actual example here. So let me just take this out quickly. If I'm better code for this. So if you look at this piece of code over here, what you'll see is at the start of this insert root function, unlocking, right? So the threaded function. So I'm creating a lock and that's effectively locking a part of the code so that other things can't access it at the same time and cause corruption. You'll see right before every time I return, I have an unlock. And if I go down here because I'm returning, I unlock and I unlock over there. You cannot unlock something that isn't locked. That'll also crash things, etc. Here's the issue with go to if you were to go to over a required unlock as an example, and it didn't unlock and you called the function again, you would end up with a function with a lock deal with a deadlock. Nothing would move. The code would just hang. That's why you've got to be so careful of jumping code. It's very, very effective at making things faster, but use with extreme care. Does that make sense? Godfrey, you can't go to another function. You can go to a label that calls the function directly off to the label. Actually, I may have actually used this elsewhere. Let me show you. Oh, here we go. So this here is remember, I was showing you guys about vectorization earlier. The routines I showed you earlier relied on the fact that you had chunks of eight. But what happens if you've got an array that ends with seven elements and you can't load that whole piece because you don't have enough data to do a full vector load? So what this thing does is say, if it's a 512 bit that's remaining in the array, load 512 bits, if it's less, load less 256 bit, then 128 bit, then 64 bit, etc, etc. What you'll see here is that in this if statement, I've got a loop value over there, 512. If my remaining entries is greater than eight, don't even do another comparison down over here for anything less, jump back to the top of this if statement process again. When it gets to four and I test because this didn't match, I go down here, I go down here, and I go down here. And I'm using the go to to effectively loop back inside an if statement. That's a performance optimization and a big one. Brian, I know because inlines don't work that way at all. But yes, this can get very messy very quickly and you use those with care because it's effective, but it's dangerous. And this is why university lecturers just have hot failure when you try and do stuff like this. The other thing when you start looking at stuff like performance is, I say this to many people that dynamic member allocation is expensive. Now, let's say for example, you are reading data from a stream of data. And you need to bother that data and process it as fast as you can, but you've got constant data coming in on a buffer and you're processing it in a thread. This creates a problem because as you're putting data into that buffer, you're also trying to process it, you're going to end up with corruption because what you're doing on the buffer, so you need to constantly move this buffer. What a lot of people will do when they're handling stuff like that is as data comes in, they'll allocate some more memory for it, put the data in there, send that off to processing and free it up once it's processed. So you're constantly allocate, put data in past, process free, constant dynamic memory allocation where you're allocating and freeing, allocating and freeing. That's slow as all hell. What you could rather do is create a ring buffer. And effectively what you do and I use this, let me see if I've got an example of this. So I'm creating a ring buffer here and effectively all this is, is a linked list. If I go to the definition of the structure, it's got a 4096 byte buffer. It's got a length and that just tells me how much is actually in that buffer because I've got 4096 bytes. It doesn't mean I'm going to use them for every packet I'm creating. And then it's got a next entry to a structure of itself. It's effectively a linked list, right? The difference here is that when I do my setup of this buffer, the final entry in the buffer points to the first entry. So I've now created a ring structure. I've pre-allocated that at code start. As data is coming in, I'm taking it, dumping it into this ring. And I'm handling that ring entry off to the processing node while telling my reader, read into the next entry in the ring buffer. So I'm never reading into what I'm processing. And because it's a ring, I'm just reusing the same memory over and over and over again in a cycle. And I'm never allocating any more memory other than the number of entries in that ring buffer. That right there, probably a 10-fold will increase in performance because I'm not allocating on the heap. That is slow. The moment that you pull out your memory allocations, you are killing performance. So you want to be able to pre-allocate and reuse. But again, like anything with optimization, that has dangers. Particularly when you're dealing with threads, et cetera, that could be accessing the same memory. That's where you get into stuff like locking your threads into ring buffers like this so that you're never reading and writing on the same elements at the same time, et cetera. But the point here is you really don't want to be doing dynamic memory allocation. And if you speak to router vendors, they'll tell you that dynamic memory allocation is death. Clay, how many bugs have we seen with memory leaks? You've got to give me a time frame. Last week, last year, last decade. Exactly. You don't run into those memory leak issues. So Andrew, when it comes to your experience with coding, do you think there's already techniques within the CC++ itself to implement effective locking for variables? You can go and find libraries to do it. But you never know how good those libraries are. Built into the language itself? No. I implement all my own locking purely because I tend to avoid libraries because it's very often faster for me to write the code myself than to analyze what is in a library. And when you're writing code, particularly really performance-based code or secure code, you have a choice. Write it yourself or make sure that every library that you use, you've combed through every single line of that code and are sure that it's optimal and that you're okay with it. It's very often faster to just write it yourself than to analyze an entire line. I guess it depends on the use case also, isn't it, too? Yes and no. Keep in mind that there have been some absolute disasters from people using publicly available common libraries. Clay, would you like to share the story of log4js? Yeah, basically a function in log4j allows you to execute code on the local system. It was just built into a search function that did an unsafe search by default. It was a bug and that library was used, well, everywhere? Yeah, just on the network hardware and the network management software that we had alone. That was probably a good 12-15 vulnerabilities we had to address and then we're not even talking about the server infrastructure and all the software running there. I can't get a lot of details, but I can tell you that we're talking about touching 100,000, 150,000 servers. Easy? Yeah, that bug, that library was used ubiquitously. It was everywhere. Department of Defense in the States was using log4j. Nobody was checking what was in that thing until one day somebody found the bug and when they found it, the whole world started getting hacked. Actually, I'll clarify that. Now, once it's starting to get hacked, people found the notes of people complaining about this being a bad thing and it needed to be fixed. It was worse than that. It was an active exploit. And this is the danger of using somebody else's libraries when you don't actually know what the hell you're running, right? I will not run. And here's the other thing that I've been seeing. There were a lot of guys out there who discovered this wonderful thing that they think is wonderful called ChatGPT, AI code. And you're starting to see people create AI-based code and hand it in or use it in production, et cetera. Met much of the time, either they're not reading what it's producing or they don't understand what it's producing. And then there was a study done at Stanford in late December, as in this December. And they took 100 programmers of different skill levels and they told them go and write various bits of code. Then they went and they said to ChatGPT, go and produce the same code. ChatGPT did this. What they found was that ChatGPT had something like 77% more security vulnerabilities than the manually created code. They did the same test against the AI assisted code writing that exists in GitHub. Same thing. Because the AI generated code was not considering things from a holistic perspective and was producing code that had flaws. And this is why I say to every programmer, if you've got code in a project, you better be damn sure you know what it does. Line for line for line. Because otherwise you've got a problem. And the more that you optimize code like this as well, the less reliance you're going to have on those libraries anyway, because most open source libraries out there are written for very generic use. They're not speed optimized like this. Very few of them. There are exceptions. If you look at stuff like DPDK to an Intel produced network stack that we make heavy results. I use that because I've A read through all of it, all 23 million lines of it. And B, because it's written for extreme performance. And C, I know all the developers behind it. So I know exactly what they're capable of, etc. But by and large, libraries on performance optimized. And so we don't use libraries. Almost, almost never. It's more work. But man, does it save us in the long run? Because it saves the work of trying to fix something after it got hacked. And the reputation damage and the financial damage. So I have a question. It's not very much related to what we've just talked about. I see you calling a lot on Windows. I'm guessing your stuff ends up on, I'll guess mostly. Okay, so something that you're not seeing from what is happening here until I click here. Effectively, every time I save a file here, it's uploading it to five different Linux boxes. I compile on the Linux boxes, I run on the Linux boxes. I'm simply using an editor to edit files effectively remotely. Yeah, I make good sense. All of that stuff. And the thing is, most desktop machines are never going to do exactly what we need on this stuff anyway. And I'll give you an example here. Is this more hardware differentiators like ECC and such? Or instruction sets? It's core count, but that as well. So if we look at, so let me find it here. I think that's the right window. So this is one of our development servers. And cat proc CPU info. This here flags, tells you all the instruction sets that the CPU supports. AVX 512, F, AVX 512, DQ, CD, etc. A lot of this stuff you're not going to find in a lot of desktop CPUs. And we're developing stuff that is service specific. So I can write it on my Windows box, but I have to deploy it here. Why not compile on my desktop? For example, I run a, so my desktop is a 32 core threadripper by AMD. That does not have an AVX 512 wide vector on it yet. It will have with the new ones that are coming out, but it doesn't. So I can't even run off the code that I write on my desktop CPU anyway. So we simply use Windows boxes, a development platform, to do all the code remotely. And the other thing is when you start looking at kind of really big boxes, most people don't have desktop boxes that look like that. Where I've got, you know, they don't have that kind of memory count. And when you start looking at kind of even bigger ones, we've got dev service for other stuff that we do that have got 256 cores in there. I can't do that on desktop. Wow. Is it a Linux or a BSD that you guys are running? Most of the stuff we do is under Linux at the moment. But you see, it's tricky when you start referring to Linux. And I'll show you why I say that. So this here is a dump of how hardware ports are currently mapped. And what you'll see here is that I've got ethernet controllers, these three up here. And they're bound to something called VFIO PCI. That is effectively giving me a binding straight to the PCI bus to the code. At that point, I've taken control of that code and Linux has no knowledge at all. It's not touched by the kernel. Packets on process by the kernel. Scheduling isn't handled by the kernel. Nothing. Why? Because the Linux kernel is developed for general use. And the Linux kernel's IP stack is fucking slow. It's really slow. No, it's a cluster. Clusterf. Wow. Those stacks are slow. So BSD has the same problem. So what's the best that's out there for networks and stuff? If you are doing really high-speed packets, you want to only have the stuff that you need in your network stack. And the only way to get that is to effectively hook the code yourself and put it into a minimized tailored stack that is used for what you need. Generally speaking, you'd use DPDK to do that hook and then handle all the packets yourself. And a good example of that is if you, let me just pull something else up here. Just finding it here. So this rather long file is part of our packet processing. And what happens is a packet comes in off the network hot and gets shoved into a ring buffer, which in turn hands it off to this. And the first thing I do is I run the whole packet through a packet pauser. And there's a couple of other special things we do with this. Because I get an array of packets, the first thing I do is loop through the entire array doing what's called a prefetch. What that does is it tells the CPU, I'm going to be needing this data going forward. Don't leave it all in memory. Pull it all into the CPU cache and do it asynchronously. So when I start processing those packets, they're all sitting in either my L1 or L2 cache on the CPU makes it much, much faster. So I loop through the array of packets twice, first time to tell it to do an asynchronous load into the CPU cache, the second time to process it. Then I start matching on various things. And I set up an entire POS structure, which tells me what my IP protocols are, what my types are, my V6 ports, blah, blah, blah, blah, blah. So it basically sets up a very quick way of matching things. Then it goes down to the actual processor. You can see the packet port is called from the processor. And I start iterating through this POS structure that I've built. And I say, if it's an IPv4 type, then if it's a firewall action, do something with it and break, never go POS the action. Otherwise, discard it. Again, when I'm discarding because of the way DPDK works with large pages, I actually do have to free stuff. But I don't want to be calling free, free, free, free, free that slow. So anything I need to discard, I drop into another array, I'll handle that as a bulk at the end of it. Then I check a resolve. If need be, I do x, et cetera, et cetera, et cetera. At the end of it, once I've gone through the whole array again, I bulk free everything that needs to be dropped in one go. And I punt everything that needs to be punted to other functions in one go. Everything else is handled in here. So it's a set, a series of bulk actions, but it's only ever covering the absolute bare minimum that I need to cover. With a Linux stack or a BSD stack, you've got TCP state in there. You've got UDP in there. You've got possibly quick in there. You've got so many things that you don't need that slows it down like way slow down. Yeah. So what's the role of DPDK? I would like stumbling. DPDK basically exists to get packets off a network card and hand them to you with some classification attached to allow for easier classification and pausing of those packets. It also allows you to do some fancy stuff with sending of packets. And in some cases, you can do actual use it to program hardware cards that support it to actually perform actions on packets without ever taking those packets across the PCR bus. And it becomes a way of programming the network card so that it never hits the CPU at all. It's done on the network card. We typically don't use it for that because what we found was supply chain issues at the moment is that if you program it for a specific network card, there's no guarantee you could buy that network card tomorrow at the moment. It's a mess. So we started processing in software because of hardware availability. I got almost the same performance. So it's a bit like what the game engines do with graphics card acceleration. So it also accelerates your stuff on a network card level. Well, it's more a case of it simply gives you more access to a lot of stuff. The problem with DPDK is that it is one of the most complex pieces of software that you will ever use even in library form. Its APIs are deep. Initializing it is a nightmare. And to give you an idea how much of a nightmare when I entered DPDK. So this entire file, 993 lines of code, every function in here is reserved just for initializing the system. So before it processes a single packet. Wow. It's crazy. I guess it's worth the efforts, right? Especially for network companies like Liquid and all that. Oh yeah. And here's the funny thing. If you look at the vendors that are selling you software based firewalls, Juniper, Cisco, etc. Most of them are using DPDK. Most of them are using the DPDK back end. So basically it allows you to become a lot more optimized on a Linux or something like that. Yeah, it allows you to be effectively bypass the kernel. It allows you to access stuff from user land that you normally wouldn't be able to do. It's crazy. But as I said, I mean, as you can see, this is not a simple thing to use. Yeah, it's crazy. The full DPDK API. It's not the Apache server. Well, you're talking 23 million lines of code in DPDK alone. If you look at, let me pull something else up here. The one good thing about DPDK is that its documentation is amazing. Very well documented. So these are all the various sub-APRs that they're having here. And they've got internet adapters. They've got memory manipulation. They've got various layers. They've got CPU, etc. But to give you an idea on when I say their documentation is good. So those are all their structures, all their macros. And then we get to, okay, those are their enumerators. The functions here, and this is just in the one API of all here. And they've got detailed descriptions of all of them. But it's a huge amount of code and it takes you a long, long time to really find your way around it. It's also some of the world's most... This is written by some of the world's smartest developers. These guys know what they're doing. They are bloody good at what they do. But don't think that just because you've got DPDK, you'll be able to suddenly do things because you're still going to know how to interact with DPDK. And once you've got those packets, you still got to write the optimized code to handle them, process them, root them, etc, etc. And that's a whole other story. DPDK just gives you the first layer. I see that. So what is the reason like every provider sort of writes their own customized stack? Quite simply, in our view, if you look at a lot of the lower-end devices, you've got a choice out there at the moment. You can go buy a Mikritik device. Here's your problem. A, Mikritik hardware is... Well, how do I put this? It doesn't scale and it gets really dodgy once you try and push it past certain limits. It starts falling apart. Oh, two. Yeah, definitely. It also has the quality of software and security on those boxes have led to some of the largest DDoS attacks against search engines and people behind cloud player. Yeah, basically, you get what you pay for. And the problem is... They're a menace. You're buying cheap garbage. No, the way I really respect the Mikritik box is crazy. You buy cheap garbage. Eventually, your cheap garbage is going to bite you. That's thing number one. Thing number two is that now you want a quality box, but you want a modern feature set on it. So you want the latest protocols. You want the telemetry. You want the new second routing, et cetera. The only boxes the vendors are going to put that functionality into on the bigger vendors are going to be boxes that are full of 400 gig interfaces, et cetera, which are completely out of the price range of near mortals, right? Your SMEs are not going to be able to afford that stuff. Yeah. So our view in what we are creating is create a device that is far lower power that actually meets the speeds and the throughputs that are actually needed on the continent, not the 400 gig that AT&T is using, but the two, three, four gig that somebody else is using, maybe a 10 gig here, et cetera, lower volume. And then add the feature set that people want. Not all the features because there's so much garbage out there. And create a product that is tailored for the African market space that has the right price point, but isn't a Mikritik that's going to fall over and be missing 90% of the features. And Mikritik right now is shipping $500 million of hardware onto the continent every year. Guess what? That's a market that I'm quite happy if I take that market. Quite happy. But as I said, this type of stuff, it's not a simple game. But the other problem is, is that the developers on the continent nowadays, and yeah, I'm going to want to rant against academia again, academia is churning out a bunch of people who work at such an abstracted layer that they have no concept of how to do any of this because they don't understand the way the machines work. They don't understand the low level pointers. They don't understand the optimizations. And so finding development resource on the continent is extremely difficult. And that blame, you know, that blame I do put at the feet of academia. And I guess for you guys, you really just understand TCPIP on a very, very intimate level. It's like a whole science. I mean, you can go to school for four years just for that. Which adds to the value of what you're trying to achieve yourself. Clay, how would you explain my kind of knowledge of protocols and the rest of it? Obsessive? No, I mean, you know the inside and out. But one of my other roles outside of liquid is that I serve as an area director for the internet engineering task force. What that means is that any new protocol that the vendors want to standardize requires me to ballot on it and actually allow it to happen. And I don't ballot on things that I haven't gone through, studied and understood. At the moment, I'm processing about 700 pages of new standards documents every two weeks. Wow. And I have to know that stuff inside out and back to front. So I would say that when it comes to the network protocol stacks, yeah, there are probably fewer than a hundred people in the world that understand network protocol stacks the way that I am forced to because of the role. Wow. And I guess it's crazy, it's very crazy. And I've been in it. I've been in IT now for 20 years myself and I don't have the depth that Andrew has either. Wow. It's just a function of the job. It's not because I'm smart. It's because I spend 18 hours a day doing it. Yeah, now talk to me about DDoS and I can bore you all day with that. Yeah, though, I still think I'd beat you when I when I wrote off the code in Smurf. Yes, it was me. Yeah, that was impressive. Cisco didn't love you, but we all did. Cisco didn't love me for many things. The first ever exploit on the PIX firewall that we set their state tables. They didn't love me for that either. But yeah, that's the other thing on it. There is a massively neglected field of security. And Clay can tell you whether he agrees with me on this, but you have a lot of people who are looking at security from an application layer. What is secure in the applications that are being looked at, what is not being looked at nearly as closely or even exploited nearly as much yet is what's happening in the base underlying protocols that are carrying the facts. And it's a hugely neglected field in security. But here's the thing. If you can't write code that is optimized like this, that can handle those packets to test those things, you can't go into that side of security because you won't see it until you test it. And the vendors, because they all run closed source, you don't know what their stuff is doing. So the only way to test the protocols is to implement them yourself and then throw the test your own implementations. And if you're going to do that, you need the optimized code. And this is why I teach people how to write optimized code because it's necessary. But anybody else got thoughts and questions? I think we're going to leave it there for tonight. There were plenty more optimizations I can take you guys through at a later date, but I think I've given you enough for tonight to get started. Any other questions, et cetera, before we break? So where's the recording going to be at? I will stick it up to sort of have that review again. I'll stick it up on one of my servers. It's not a problem. I'm not exactly short of bandwidth to host it or servers. Yeah, that's true as well. But just to make people jealous, I have to leave you guys with something. And this is probably going to be really slow because it's running directly from my desktop rather than one of my servers downstairs in the garage. But this is always fun. Yeah, my desktop's running kind of slow. I would guess your computer's limitation on your network. Wow. Yeah, that's part of it. Is that wired or wireless? No, that's fiber wired. Oh no, let me see if I can share something else. I'm going to use guys compound. Meanwhile, my my laughing stuff fits in the palm of my hand. So scale is right if I share this. So there's a magical powerful box that's sitting down downstairs and that's more like it. That this is wired as well. Yeah, what can I say? It's like the whole internet is right in your laptop or computer or something. Well, that and the fact that what isn't is backed by the half a petabyte of disk arrays sitting in my garage. Well, to keep stuff flying in case. Also, I made it to be fair. How how many what you haven't gotten to 400 gig yet? How many 100 gig links do you have from your house? At the moment, four. Four 100 gig links and you're only pulling five gigabits per second down. That's because the Netflix the the Netflix servers that run false.com kind of saturate. Whoops. But but Clay, I am about to take those things and change the optics to go to four by 400. 1.6 pyramid. Nice. Why not? I guess at this rate, your CPU is much slower than you. It's not the CPU. It's the PCI bus. The problem is you can saturate. So this is the other thing about a lot of the kind of people ask me often why we don't do a lot of GPU or floating. The problem is, is that the PCI bus on a PCI 3 bus, which is still fairly common. You're getting more and more PCI 4, but PCI 3 is still what is most common out has a limit of around 80 gigabit a second in each direction. And once you hit that limit, if you're passing that much over the PCI bus, your PCI bus becomes your bottleneck. And then you've got a real problem. And so you typically don't want to start shoving stuff across your PCI bus that you don't have because you can bottleneck on it. This is why even if you look at a lot of the vendors, I mean, Clay, it wasn't that long ago that most of those boxes were running what 400 gig, 400 gig per ASIC, I think. I know there's depends on which box you're talking about, but yeah, it was four by 100. And before that it was four by 100. And we're not talking long ago. We're talking like, you know, in the last couple of years, it's getting a lot better. Basically pre pandemic versus post pandemic. Yeah, you know, I think in the middle of the pandemic, everybody had to sit at home, bunch of people got bored and decided to design some really six stupid fast ASICs. They have a cost, a big cost, they're about to come out with 1.6 terabyte optics. I don't want to know what that's going to cost. I can tell you now that I won't be able to afford it. So yeah, that liquid well. Oh, I don't know, dude. 1.6 terabyte. Did you see the original price of a 10 gig optic is to run you $10,000 an optic. Now they cost $30. I'll wait. Thanks. It's actually quite fun. I've got a bunch of optics on my desk here. So do is my camera on. If you look at that box there for optics, if I bought this a year and a half ago, I'd be holding probably $8,000 in my hand. Now I'm holding $800 in my hand. The you never want to buy cutting edge optics. Man, you will pay. And they come down fast. Yeah. But unless you really, really have to. And by really, really have to, you mean, I mean, someone. You're Microsoft or Apple. Someone else is paying for it. Exactly. But yeah, anybody else got any questions? John Godfrey, Charlie, Brian. If we break. Yes, yes, yes. Hello. Okay. First of all, I'd like to thank you for such an awesome session. You know, this is one of the things that I think this is my second session with you at such times in the night. So, and I admire it. I mean, it's crazy. Yeah. So I wanted to know what kind of advice will you give someone who is like, let's say, six to around six years into it. And they are really interested in the kind of stuff that you are doing. And would love to. Yes. Okay. So what I will say about anybody, particularly in the programming field, and I think Andrew can talk to this as well, because I've been taking him back to this as well, is that what you're taught normally in academia at the moment is about abstraction layers. It's designed that you can walk out and produce very generic stuff for enterprise clients, et cetera. It does not teach you actual coding. It does not teach you to understand machines. It does not teach you the fundamentals. And it's the same in the networking field. You can go and get vendor cert or the vendor cert. But vendor certifications are not designed to teach you networking. They're designed to teach you vendor syntax. I've met CCIEs, if I take them off a Cisco router, do not know how to do a damn thing, because I've got no understanding of the networking beneath that syntax. Where am I going with this? If you want to develop in this field, lose the abstractions, go back to basics, and start chasing the fundamentals, how does this work? In programming that comes down to how does memory work on a PC? How does a CPU work? What are the layouts? All of this information is out there. I point out that I don't have a degree. I didn't finish high school. I left when I was 16. I had enough. And back then, I didn't even have Google. I had Usenet and Gopher and funny things like that. Um, stuff that probably predates even clients. Hey, if I go for his, have you ever used it? Not in any professional sense of it. Just probably close to every got was just the BBSs. What about Archie and Veronica? Nope. I can tell you that now. The thing is nowadays, you've got it easy because this information is available. I always tell people that see as a language is a very good place to start. Not because it's not because I love C, but because you can't work in C effectively without those fundamentals. It forces you to go back and understand how memory works, what the heap is, what the stack is, how you allocate, how you deallocate, the risks involved in threading all of that stuff because it is so low level, you're forced to get to know that. So I say to people, start with that in the programming world, go back to C and not C++. Please, C++ is still more abstractions, more complexity and is not going to take you back to the level that C will. So start with C and every day, give yourself a challenge. You pick something that you do not know how to do. There's no point in picking a challenge you already know how to do. And then you sit there and you chase it until you get it right. You Google, you ask, you chase, but no one's going to hand this to you. You have to set those challenges and then you've got to spend the time and the effort and man, it takes time and man, it takes effort. But little by little, you'll be able to take those challenges more and more and more and more and start putting it together and gain that experience with an understanding of what sits beneath it. I still do that to this day. I set myself challenges I do not know how to do because it keeps me developing. But it all starts with going back to the basics. The basics that no university anyway is going to teach you these days because they want to produce a mass of graduates that can go out there and produce Python apps with no understanding of how to actually program. You know, there's a song that said video killed the radio star. Well, guess what? Python killed the coding stop because it made sure that people didn't need to have a clue. I personally feel attacked by this, but it's not wrong. I mean, the other thing too, I'll just say in general, and this is where I'm absolutely envious of Andrew and actually envious of all of you if you've done any serious level of development, those skills do translate out to other things. Maybe you're doing the firewall stuff with Andrew for one day, but if you go into any other job or any other company, you will find uses for your code and your skills. The whole joke about automating a person's job away with the tiny shell script, that shit is true. And I've been both on the giving and the cleaning end of that. I will not, if I'm hiring today, will not hire a network engineer that can't program. He's less efficient because the guy that can program doesn't have to log into 5,000 routers to go and do them. The guy who can code can write one piece of code. And then in an ideal world in the networking profession, I would rather have 10 engineers that are sitting, reading Facebook all day or sitting on WhatsApp, or basically I want those engineers doing nothing. Because if they are doing anything, something's broken. Your engineers should be there only for when something goes wrong. They're an insurance policy that doesn't always apply for design engineers where you're expanding and designing, that's a different field. But your operational guys, they shouldn't be busy. If they're busy, you've got a problem. And I learned this lesson when I built the Tenet Network in South Africa. The entire company when I was there was 10 people from the secretary all the way down. 10 people, that's it, that included the CEO to the secretary. That company ran a network with a 5-9 uptime that serviced 250 campuses with three quarters of a million students behind it. And it ran 5-9s uptime every single year I was there. Number of engineering staff we actually had there, why? Because we took the thing down to a fine art where everything was done by code, by monitoring, by everything else. Tons of automation, tons of things that monitored it for us, etc. The only time you ever touched it was if something really broke. But to do that, you've got to be able to code. Can't do that manually, not possible. Quick question, were they using C or Python? Well, so I've led the development of most places that I've worked and Python didn't exist when I started writing code. And while I've written a fair amount of Python today for various things, I've been doing C now, what, it would be 31 years. And I typically teach people how to do C. But then again, I code in 17 languages, but for this type of stuff, it's all C. I mean, the network automation was C. Mostly Golang, actually, did a lot of Golang because when you're dealing with a hell of a lot of devices, you want to be able to thread and do stuff in parallel and Python can't thread for shit. Python threading is non-existent because it uses something called global locking. Also, a lot of Python automation stuff relies heavily on shit products like Ansible, which I have a hatred for personally. It's good to get started in automation, but it's not, I would call it semi-automation. You still have to make a decision whether you're going to fire off that script or that plan. And if you need to do a lot of customizations, let's say you have a, let's say you have one router, for example, that peers with, you know, two or 300 different service providers, that's going to be some pretty heavy customization. And the Ansible playbooks and such just do not give you the flexibility that you need for that. And by the way, guys, I'll say this. I will let Clay choose to decide whether or not he reveals who he used to work for and what he did. But when Clay says stuff like that, you're talking about a guy who worked on infrastructure that scales and stuff that no one else has called his ever even gotten close to, even gotten close. So when you talk about automation of really large networks and large systems, Clay's a good guy to listen to. Clay, you are free to elaborate if you wish. If you Google my name, you'll find out who I worked for. And if you've been in the WhatsApp channel, you know that Elon is my most favorite person in the world. So I'm sure you can figure out where I worked. But you know, going back to that, so like for me, right before I no longer worked at, said Bluebird, I was working on a Python Flask application to do all of our ACL ingestion. So that way we could build templates off that. And prior to that, for a batch config push that I had to write up in certain networks, well, I mean, in particular out of band networks, out of band is a special type of network area that you use to get into the back door of systems. That's where your consoles are for your actual routers and switches and such. So when everything goes completely pear shaped in the world on fire and you can't get into your production network or through your main main in band in path management, you hop onto the out of band network. Unless your Facebook and need an angle vendor because you screwed up your out of band network and your authentication into your cages was based on your out of band network, which has broken and you can't open the cage. Yes. And so there's no problem. Now let's say I need to allow 50 people to have access to that out of a network. Well, you're assuming that you're going to be using this because something's gone terribly wrong. You can't talk to your authentication databases. You can't use radius to check against user names. It needs to be self-sufficient on the box. We need to identify each user on the box. So why not use something like an SSH public key public key authentication? We can we store their user store their public key inside of an LDAP attribute. And I have a list of all of all the network devices in the source of truth. So I'll write Python code that will pull out all the correct users out of the out of the out of LDAP, grab all their SSH keys, and then transpose them to config based on platform and then push those configs to the out of band network devices so users can get in and out of those devices. Fun stuff like that. And you see the trick here comes in and what Clay is describing, I mean Clay, you probably didn't use massive kind of threading and parallel operations in doing that. Well, so that was actually interesting with the Python stuff. So I agree with you when we're talking about billions of devices or billions of packets. You're right. This is absolutely Python. Don't even think about Python. But I could push that stuff out and about I think three minutes across all the devices using a library using a library called Nornear, which handles all the parallel does all the parallel stuff for you automatically. It's actually quite just probably using Cython in the back end. Yes, it is. Yeah, you see, there's the trick again. It comes down to it's not Python doing the stuff. Yeah, they've offloaded it to see to avoid the global locking. And the problem with offloading to see like that, because we used to do it in Golang as well, is that you take a tremendous hit when you start moving from one language to another like that, because it has to learn all sorts of things to make that work. And it's the same in Golang when you can write Go that uses DPDK. But you have to ensure that when you call C, you stay in C for long enough to be worth the hit that you take from that call, because that call is extremely slow because of all of the work it has to do to be able to make that happen. So you will only call C like that in the long running functions. And even then we couldn't get the performance we need. We got close, but not quite. So it's quite expensive, sort of like to be in between languages. I think Lua has sort of affected is the best at doing that. The best at doing that is actually Rust. Yeah. A lot of people ask me why we don't use Rust. The answer is really simple. Rust is a really nice language. When you start tweaking, et cetera, Rust can be extremely restrictive because Rust is built on a philosophy of everybody's an idiot and is going to screw up. So we're going to make damn sure that you can't screw up. And no, not C++. C++ is just awful. Sorry, Brian. You will not find many people more opposed to C++ than myself. It's horrible for so many reasons. But yeah, my advice to anybody, be it in networking, be it in programming, be it in anything, forget the abstractions and go back to the basics. Go back to the fundamentals. To put this another way. Look at how a hacker breaks into a computer. A hacker gives a computer input in order to get it to give him output that the programmer didn't originally intend. That is the same philosophy that works in social engineering and in manipulation in humans. When you are socially engineering someone to get them to do something or scamming someone or punning someone or even interacting with them, what are you doing? Giving them input to solicit a particular output. And it's about giving the right input to get the right output. And when I say right, I mean what you want, right? The trick with that and with computing is to be able to give it a computer input to get the output that you want. You cannot do that if you do not understand how the computer is going to interpret and use that which you are giving it. You have to go back to understanding how the computer actually works. How the memory works. How the CPU functions. What are the registers? What is the heap? What is the stack? You've got to go back to the basics. Same with networking. You cannot, you can go and deploy an entire network using syntax commands on the vendor. And you know and you know whether or not you actually knew what you were doing is the day that goes wrong and you have to debug it. You know how many engineers I've seen that have deployed these huge networks and they worked on day one and then they broke. They couldn't do them because they didn't understand the underlying protocols. Go back to the basics, strip away the abstraction. That's the best advice I can give anybody. Yep, I can agree with that as well. And it's shown even recently with my previous employment too, although I won't talk about it because we saw a recording. Are we? Yeah, I can turn off the recording at this point. Yeah.