 OK, yeah, I'd like to speak about some high performance and scaling techniques that we've been doing in Golang, a small breakdown of the presentation, a few slides about myself and some work that we do with Minio. Do you understand me or not? OK. Is this better? OK. Is it better like this? Yeah. OK. Great. A little intro about myself and Minio. Then I'd like to speak sort of in general a little bit about, yeah, the Golang or Plan 9 assembly capabilities that Go offers. Then discuss two sort of projects that we've done with this, so Blake2b, Acceleration and Chart256, Acceleration, and finally some slides about some, yeah, distributed sinking stuff that we've been doing. One slide about myself, I've been mostly doing software development in sort of what's called the medical imaging space, so that relates to like CT and MR images from scanners, both 2D, 3D, as well as using GPU techniques and the last few years I've been involved in cloud computing and I'm now with Minio. Minio, as maybe some of you know, is an Amazon S3 compatible, yeah, Op-Yek storage server. It is, yeah, written in Golang under the Apache 2.0 license. The company was actually founded by Anand Babu Parishyami, who maybe some of you know as one of the guys behind the Gluster FS system, which is in itself a distributed file system, which is now part of Red Hat, actually. If you look at Minio itself, there's actually sort of really, yeah, one project, one binary, but you can run it in sort of three different flavors, so the simplest version is simply run Minio server with a single directory and then all your objects will be stored underneath this directory structure and there are also, yeah, two other versions that use a technique that is called erasure coding, so what this essentially does is all the objects are split up into both data chunks and parity chunks and, yeah, split over, yeah, multiple disks or multiple servers and the Minio server actually, the Excel backend which splits the data over multiple disks and goes from a minimum of four disks up to a maximum of 16 disks and it also uses a technique that is called BitRot protection, so this means that when the data is being read off the disk again, which can be many years later than when it is stored, a hash is computed to detect any, yeah, BitRot changes and if that happens then, yeah, we have other parity blocks to reconstruct the original data, but as you can imagine this BitRot protection is a very frequent operation, so any data that is written to disk hashes have to be computed and likewise when data is being read off the disk, yeah, the hash needs to be computed as well before we actually can return, yeah, any information to the client, so that's a pretty important operation for us and that's why we really looked at how can we get sort of the maximum performance in terms of hashing speeds while still, you know, having a solid proof and hashing technique, so that's actually how we, yeah, did our Lake2B project, but before I go into that, let me explain to you a little bit about, yeah, sort of the goal laying or plan 9 assembly capabilities. This is actually an integrated or integral part of the whole go tool chain and it's actually, yeah, it's kind of like a pseudo assembly language in the sense that it's not a direct assembly that you would write for, say, an Intel platform or an ARM platform, so there are some generalized instructions for like a move and an add and a compare and these instructions are then obviously translated to the actual instructions that will run on the underlying hardware platform. Most of the time, it's kind of logical how like this pseudo assembly language translates into the underlying assembly running on the CPU itself. Sometimes, yeah, that is not the case, so sometimes a little bit of trial and error there and also some architectural aspects of the underlying architecture they shine through, so on the ARM, you have like conditional instructions and you can do that too and obviously on an Intel that will not work. Also, for instance, data flows from left to right, so if you do like a move, you have a move R1 to R2, it's actually R2 becomes R1 and, yeah, like on the ARM platform in the other way around, so there's some things you have to be aware of. There's also some, yeah, like pseudo registers, so they're not actual registers but kind of like simulated registers for like frame pointer, stack pointer, program counter and also not all instructions that you would want to write are available, however it is possible to also, yeah, not use sort of the mnemonics but use the actual opcodes that will go into the assembly, so if it's not available you can resort, yeah, to that. If you talk about, yeah, what are sort of the advantages that you get with like assembly language and obviously, yeah, it sort of gives you the ability to get the maximum performance out of the underlying hardware and also if you do this, then, yeah, you still benefit from, you know, the nice quick and fast compilation that Go offers, you know, to get like assembly into your code you can also use, for instance, the cgo route but, yeah, that has some disadvantages. One is that, yeah, you need to have the cgo, yeah, available, it takes longer to compile because, again, it's more like a c-style compile that happens and also when you call into that code then, yeah, there is some runtime overhead because all the stack needs to be saved and everything and if you write your own assembly, yeah, you don't have that overhead and obviously if you use assembly you can take advantage of the SMID instructions for Intel or the neon instructions that are available on ARM. It's maybe a little bit kind of a word of caution is that you, yeah, you are a bit on your own if you do this kind of stuff. The documentation is kind of limited and a little bit sparse. There is a considerable amount of kind of example code in the Go repository itself as well for stuff like new taxes and some lower level stuff so actually if you grab for start out as in the Go code you will see, yeah, a lot of examples there and likewise we now have a few repositories where there is also some Go code in there and as an example basically this is actually also a file where you can sort of see the translation between this pseudo-assembly language and the actual ARM instructions as they are being generated. If you want to integrate assembly into Go lang then you have to use the .s extension for, yeah, basically assembly and that is actually prepended by sort of an architecture, yeah, identifier. So underscore ARM 64 means that the assembly is for the ARM 64-bit platform and likewise AMD 64 will be for the Intel or AMD 64-bit platform. And actually here is a snapshot of a repository. It's a little bit hard to read but here actually you can see actually several versions for the AMD 64 platform because in this repository we actually have a dedicated AVX2, AVX and an SSE version and then there is an ARM version. I'll talk a little bit more about sort of AVX2 and AVX later on in the presentation. So kind of one approach that has worked nicely for us is if you do this kind of stuff is to actually start out with an algorithm at the Go level itself because even if you do this then you are not likely to do all the work for all the different architectures that are out there so it always makes sense to have as a baseline sort of the functionality in Go itself and then sort of as you start to translate that functionality into assembly what we sort of did is we started out kind of like with small bits and pieces and so you comment out most of the Go code in your routine or class or whatever and then you start to translate small bits and pieces into assembly maybe initially starting just with passing in the arguments and writing back some results making sure that all works and once you got it working then you gradually start to translate more and more of your Go functionality into assembly and along the way make tests whether the assembly functionality is equivalent to the Go assembly. So actually if you look for instance at the commit history of the Blake 2B SMID repository you can actually see how we used this technique along the way we also developed a small utility called SM2Plan9S which actually generates byte sequence codes for Okko or assembly instructions that are not natively supported by Go itself so actually it uses the YASM assembler sort of behind the scenes so what you do is you basically write your instruction as you would write it regularly in assembly and then you run the tool on the file and then it will prepend that instruction with the actual upcodes that will go into the byte stream that will be executed. So that has worked nicely for us. If you look at what we have done with assembly we've actually accelerated the Blake 2B algorithm as well as the SHA256 algorithm and we're planning to do one more piece of work which has to do with this erasure coding which is actually called readSolomon and this uses some what they call Galoi field arithmetic polynomial multiplications and ARM actually has a specific PMO instruction so we want to use that instruction to accelerate the readSolomon computation of the parity blocks for ARM for Intel it is already accelerated but we want to do the same for the ARM platform and maybe when there is a need in certain pieces of the code then we will likely do more. So the first technique that we accelerated is a hashing technique that is called Blake 2B. Blake 2B is a hashing technique it was actually there was like a SHA3 competition a few years ago and they were one of the five I think final contenders. In the end they were not selected but it is a nice hashing technique with sort of the characteristics that it's really focused on speed as well as kind of like a relatively simple algorithm to implement while at the same time still offering sort of top of the bill security and it's really optimized for 64 bit platforms. So we developed a repository using the SMID instructions to accelerate the Blake 2B algorithm and actually we did it in sort of three flavors so there's an AVX2 implementation which is the fastest then there's an AVX and then there's an SSE implementation so depending on what your CPU support then obviously the highest level of SMID instructions that is supported that one will be used but overall we were able to achieve close to like a 4x performance improvement over the high level go functionality so yeah that helped us quite a bit because if you look at the table on the right you can it's a little bit too small but we can do about 850 megabytes per second with the Blake 2B algorithm on an AVX2 machine and this compares to if you look at SHA256 that is at like 190 megabytes per second and SHA512 is at about 300 megabytes per second and as an object storage you can imagine if you have large blobs of data written on the disk then again for us to be able to return like the first byte when people ask for an object say if it's a one gigabyte object then having a factor of four there that means the difference between being able to return something within a second or within it taking four seconds so that's a pretty significant saving there and again yeah so we developed this technique for the bit rot detection mechanism that is part of Minio if you look a little bit closer at sort of how it works at the top line those are actually there's a very small portion of all the kind of computations that are happening as part of the Blake 2B hashing so yeah there's basically four yeah additions there and these are on 64 bit integers and if you look at the AVX capability for the Intel platform you can essentially work with 128 bit wide registers so this allows you to do literally like two additions in parallel and this is kind of what you see here so with AVX2 you use what I call the XMM registers so again those are 128 bits wide registers so essentially this means that whereas here it's kind of like two lines you can do it here basically yeah in a single line and likewise at a second XMM instruction will also take care of two of these guys so you can sort of go from like four to instructions to like two instructions and if you go one step further AVX2 extended the pipeline to 256 bits wide which means that yeah you can use or do essentially four 64 bit additions with a single instruction so that's actually this guy and then here you use the YMM registers instead of the XMM registers and again those are 256 bit instructions wide so in a way this kind of explains why there's kind of four times speed up between yeah like Golang and AVX2 there's a bit more there but I mean fundamentally this explains it quite a bit if you look at Blake 2B the algorithm itself it actually has sort of 12 rounds of yeah working on the data this is kind of like the preamble where first the message is being read in then it's kind of shuffled around and then there are two sort of macros that actually do the actual like exploring and adding and rotations of yeah all the bits then there's like a diagonalized macro there which kind of rotates values between the different yeah values that you are working on and then there is a second half where you load more from the message you shuffle it again and you call the same macros again and finally you diagonalize so you sort of shift it back and if you look at yeah one of the macros so in this case this is the what's called the G1 macro it actually consists a little bit of like three columns so on the right side of the screen you kind of see the high level Golang instruction then in the middle part of the screen you sort of see the AVX instruction and then on the left side of the screen is the actual yeah upcodes that are being written into the assembly stream and so actually what this guy does is here there's like this is an addition this is like an XOR here you actually see a rotation but the Blake 2 algorithm has selected like shift values that nicely map on to yeah the Intel architecture so you can actually while it's a shift over there it's actually a shuffle so you shuffle bytes around in between the registers so and then there's more additions more XORing and then yeah get another kind of rotation and this happens you know over and over again and then in a total of 12 rounds and that's you know how you get your hash out there okay that was about the Blake 2B algorithm we've also done some work on yeah shot 256 algorithm most notably also for the ARM platform because the ARM platform actually has specific instructions to accelerate shot 256 yeah calculations and they make tremendous amount of difference because literally is a hundred times faster than not using those instructions and actually Intel also has instructions for shot 256 accelerations but they seem to be just defined in software and there seem to be no hardware implementations out there otherwise it would be nice to take advantage of that as well but actually the table here at the bottom you can sort of see yeah the speeds that we that we have an interestingly speaking actually the ARM just running at 1.2 GHz is actually the quickest one out there so it does about 640 megabytes per second and then we ran some tests on an Intel Xeon at 2.4 GHz the AVX2 comes in at about 355 then the AVX is running around 300 SSZ still a little bit lower than that and these are actually the go versions so if you do go on the Intel Xeon platform you get about 190 and the last one is the the ARM version if you look at you know how do you actually invoke yeah assembly code from go you do you essentially define kind of a yeah the prototype of the function that you're going to call into so for most of this hashing techniques you pass in like the digest that you start with and the message itself so the digest is basically just a slice of 32 bit integers and the message is just a slice of bytes and here we actually compute the 32 bit integers into the digest so that's this guy and then basically the actual message that you're going to hash is being passed in directly into what will be the assembly code and then once you're done then actually the result is written in this guy so that is what you what you return and this is actually an ARM example so the first line that's basically how you define the function in the assembly code and the first thing that you actually do is to read the parameters that were being passed in and this function takes two slices as as inputs and as maybe some of you know a slice is actually a structure with like three elements in it and the first element is a pointer to where the actual data is and the second is actually the length of yeah the slice so what we're doing is the first few instructions were reading this here and because this is running on 64 bit platform both slices in effect they take up a total of 24 bytes so again it's three elements but each element is eight bytes long so the first instruction basically fetches a pointer to where the digest is and the second line fetches a pointer to where the message is and then at the length of the message is being read in the third statement so it's another eight bytes further than yeah the already 24 bytes which point into the second slice that is being passed in and then further down here we do some initialization so here we load from x0 which corresponds to r0 the digest and we also load obviously from the message itself and there's also a constant stable so we get a basically a reference to the constant stable that's actually in r3 and r3 corresponds in neon to x3 so here we load values from the constant stable and this is actually sort of the main loop where the hashing is being done and again here r1 which is x1 is the pointer to yeah the message here actually the values from the message are being read and well there's some positioning codes to make sure things are in kind of the right order and then these and these guys are actually the specialized ARM instructions that give you know basically the hundred times speed up and then there's some more you know shuffling of data around but yeah all I have because these are specialized instructions the actual assembly is yeah pretty pretty pretty short and then yeah this is like the end of the routine so again there's some yeah ARM instruction or SHA extensions instructions here and then here it's basically put into v1 and v0 and v1 and here actually you write out the result in like register 4 1 and v1 into x0 which was the pointer to the the digest and then you return out of there and obviously here well as long R2 is the length of the message show you process 64 bytes at a time and when there's more work to go obviously you loop back and so that's pretty much you know how yeah a function like this looks a little bit if you look at some yeah some resources that we have that they're out there actually as maybe some of you know some of the origins of the go yeah like architecture and work go back to the whole plan 9 yeah worked on that has been done and actually yeah some of the assembly yeah still originates back to that so some of the documentation out there yeah it's literally at the gp.io website so not everything in those URLs is kind of applicable most of it is but it's their generally generally good resources to read through and get an understanding of how this all works and ties in together and then there's also like a high level as some document and there's also some pointers for arm some pointers for Intel about you know how neon instructions and you know AVX instructions work okay that was sort of the like the performance optimizations that we've done for yeah hashing techniques we also did some work on something that we called a distributed syncing technique when I earlier talked about the three flavors of Minio that are there the the last one that was listed is called the distributed version and so we have two versions that can work with erasure coding and then distributing the data across yeah multiple either disks for the Excel version or actually multiple servers for the distributed version and for the distributed version that can actually run from anywhere of a minimum of four servers up until a maximum of 16 servers we needed to have a synchronization mechanism between the servers and yeah we also looked at some existing protocols and techniques out there but yeah we like a technique like raft but we found that a little bit overkill for sort of what we were trying to do so we designed basically a pretty kind of minimalistic synchronization technique that yeah works well for us the design goals that we wanted to have was basically keep the design simple also meaning there's for instance no concept of like a master node because well if you have something with a master node and the master nodes go down goes down then yeah your whole system is down so you need to have backup master nodes and then you already talk about two or three master nodes and it quickly starts to become you know more complicated than what you initially think so in our system there's no concept of a master node so all the nodes are basically yeah equal it's also resilient in the sense that if multiple servers go down yeah the system just continues to function so basically up to a total of yeah and over two minus one servers can be down so yeah you can continue to function and yeah we also wanted it to be kind of like a drop-in replacement for the existing primitives that are there in the the goal language so the API or the interface is synonymous to the sync.rwmutex and actually the sync.locker interface as part of our requirements have we maximized sort of the maximum number of nodes to a total of 15 or 16 you could actually if you wanted to push that number up a little bit but 16 is good enough for us and how this works behind the scene is that all servers basically have a connection to all of the other servers and when one of the nodes asks for a lock it basically contacts all the other nodes and when a majority of the nodes yeah return back the lock then yeah the lock is grounded on that particular machine and when you release it again it sends out release messages and so it's relatively simple but yeah for us it works quite well. Here is an example maybe a little bit difficult to read but again the API is compatible with the regular rwmutex so basically this is an example where you just create a mutex and well then you lock it twice then there's some go routines that after one and two seconds unlock the read lock and then here you try to acquire a write lock and obviously the write lock can only be granted when both of the read locks that were earlier granted were released so you can sort of see here that so it's acquiring read lock one read lock two and then it's trying to acquire the write lock but it will block here until both first the first read lock and second read lock are released and only then will it require the write lock and obviously running this on a single machine is not really the point but you can just as well create basically three of these names mutex on three machines and then run one read lock on one machine another read lock on another machine and the write on the third machine and then the whole sequence of events would be would be identical in terms of like performance on a 16 note configuration we were able to do like seven and a half thousand locks per second and then it takes about sort of 10% CPU usage on like a regular server and typically a lock is granted within like one millisecond so this quite nicely met basically our requirement in terms of an object storage an object storage does not have the requirement that you're going to write like a hundred thousand objects to it in like a second so maybe for something like a key value store this wouldn't work but in our case it quite nicely met our requirements and it's nicely bundled with the menu distribution and actually here's a pointer to the repository and we also on our block we have a block post about this with some more details so that pretty much brings me to the end of my presentation thank you for your attention and go ahead so I have a question so you are using Blake to detect big big road and then you have performance problems but Blake is a secure hash so why don't you use a narrow detecting code which is probably much much cheaper instead of using a secure hash and then what do you mean by an error detecting code well actually something like rich Solomon or probably humming or I don't know something which is cheaper rich Solomon doesn't give you bit rot I mean it detects bit flips so maybe maybe rich Solomon I can't remember I have to look it up but humming codes for example detect errors and there are much better suited codes for detecting flip bit flips no the problem is there's there's parity blocks computed in rich Solomon but imagine you have four two data blocks and two parity blocks and when you lose say two discs so you're down to two blocks then if you would just have two blocks you have no way to determine anymore whether there's bit rot or not so you're right if you still have all the data and parity blocks available you could work out whether a block that you read off the disc has not been unchanged but it would be not easy to do so and when you lose those discs so you're down to the absolute minimum that you need in order to be able to reconstruct then there yeah then you have no way to detect bit rot anymore we talk about this later I don't agree yet but let's you mentioned that this code doesn't rely on cgo is there any complication at all compiling this kind of programs no it does not rely on cgo but what what is your question well is there any any complication compiling this program or is it just a go build and that's all that's it if you do a go get you will get all the repositories and it just builds normally and there's yeah you just need to have the go chain ok installed but no special other options or features of the go tool chain ok it will just adapt to the to the platform to the cpu and absolutely right any other questions thank you question there coming I was just wondering how did you write the assembly I mean you showed us all these opcodes did you use some other assembler and then extract this or how did you that for that we developed basically this tool so we developed this SM2 plan 9s tool which is also here you see the repository so basically this is what you write in your assembly file and then you run SM2 plan 9s on this file and it will prepend like the byte sequence I missed that ok thank you otherwise it's a bit cumbersome pretty cumbersome any other questions no oh yes yes my question was regarding to your SHA256 implementation have you thought about contributing it into the standard library yeah we were planning to do that but we need to do a bit more for that but that will be coming second question how come your implementation is faster than the GoSunder library for AMD64 since both are in assembly and have NVX2 implementation so what did you do something special well I mean there's different ways to do different things in assembly right so I mean it's not going to be effector 2x but I mean depending on how you exactly do it you may see like still minor differences so just like in regular high level Go you can also one maybe way of doing things maybe slightly better than another way of doing it so thank you