 Thank you so very much and thank you everyone for joining us today. So I head up customer success here at Red Panda and we'll be talking about Graviton and ARM and how that compares to x86. So just a quick overview of what we'll be covering. We'll talk a little bit about the Graviton story and where that originated from. A little bit about Red Panda and what is Red Panda and why is this interesting in the context of the comparison between the two different architectures. Red Panda is written in C++ so we'll talk about some of the implications of that and also the thread per four model that we have within Red Panda and how that affects things. We'll also then we'll get into the benchmarking of ARM versus x86, talk about conclusions and then have any you know answer any questions that might come up at the end. So with that let's talk a little bit about Graviton. You know Amazon we've all used ARM processors if we have a mobile device so we all have some familiarity with ARM chips or Raspberry Pis and other things from that perspective. But Amazon sort of first started with Graviton in 2018 and these were pretty it was interesting to see what they were doing in 2018. The performance was still kind of in the early stages and very quickly after the initial launch Graviton 2 came out a year later which was actually a 50% performance improvement on the original Graviton launch. A few things that are different with ARM versus what we typically get with Intel instances inside of AWS is that every core is a physical core inside of ARM. So when you know when we're seeing something that says v cores or other things like that you know in the instance information for these Graviton instances you know these are actual physical cores and they're not a hyper threaded core like we're typically are used to when we're looking at the Intel instances where you know two hyper threaded cores is giving us a full physical core. This you know all of this was pretty interesting to us as a company as you know as Red Panda is where you know very focused on performance and also cost of performance. But we didn't really get interested into this space until AWS just recently launched in November of 2021 their storage optimized instances. Previous to that you know the storage that was actually attached to these Graviton instances was either very small or you were relying upon EBS and persisted disk which you know comes with its own kind of performance impact and sort of cost along with that as well. So I'm but you know in November 2021 a few different instances were actually introduced that you know added a large amount of storage to the systems and those are going to be the instances that we're really going to be focusing on today. Graviton 3 also is just was just released last month and they first showed up in the C7G instances. We're interested in seeing what that sort of does in the additional performance that you know we might get out of those instances or out of this Graviton 3 architecture. But you know they haven't made their way into the storage optimized instances just quite yet. So you know typically this is a pretty similar thing that Amazon does across multiple of instances where they start with like the compute where the compute optimized instances first and then eventually move them down into general and then into also the storage optimized. So you know we expect later this year we'll start seeing them show up in these storage optimized instances as well. So what instances are you know do we have available to us that are really you know focused on this storage optimization and it really comes down to the IM4GN instances and the IS4GEN instances. And you can see that you know you're getting a large amount of storage on a per instance basis really the difference between the you know the two different instance types or base instance types is the amount of cores that you're getting for this storage and also memory. The other key thing is really the networking too as well. You know we can see the IM4GNs you know we are getting you know you're at the fore extra large you're you're able to you know have guaranteed networking and you're able to go you know a fair bit higher all the way up to 100 gigabits per second whereas with the IS4GNs you know this is more focused with you know workloads that might be less CPU intensive but still have a requirement for a lot of local storage. So you know really just a ratio change from you know from one side to the other. And you know just to quickly talk touch upon the x86 side of things and the different types of you know instance types that we're going to be using there. We'll talk very very briefly about i3 you know this is the i3 instances where some of the early storage optimized instances of Amazon you know had introduced these are based on Broadwell and introduced back in 2017. We'll be doing most of our comparisons against the i3EN which is Skylake or Cascade Lake and these were back in 2019. These are kind of the default instances that you know we run Red Panda on you know pretty regularly and you know a lot of different databases you know typically default to i3EN and also a number of different you know storage heavy applications you know typically rely upon that. I have also done a we have a little bit of a comparison with the i4I instances which are iSlake introduced this year. These are kind of interesting you know the performance at least from an IO perspective is pretty amazing. You know there are some downsides to the i4I instances and I'll talk about that as we get a little bit further in the presentation. But then the other thing to really kind of take note is that you know as I was mentioning earlier on the ARM side you know we're going to see VCPUs and that's a hyperthreading core or a hyperthreaded core that we're going to get so you know we're not talking it so it's always interesting doing this comparison of you know we're talking a comparison of cores to hyperthreaded cores in you know when we're comparing the instance types. And you know just to kind of quickly touch upon like what we have there you know we can see the i4I instances we actually are a little you know that the storage ratio is a little bit lower. You know we're getting probably half the sizes what we would see with the i4 or the iS4 instances. But you know you can you know with the i4I instances you know we are you know getting the AWS Nitro SSDs and then looking at the i3EN instances you know we have more storage per instance. But it is using kind of a slower chip or kind of a last generation CPU architecture. And you know we are getting the disks are a bit slower in comparison to the i4I instances. So before we start really digging into the benchmark itself you probably have the question of you know if you're not familiar with Red Panda probably one of your questions is you know what is Red Panda and why am I wearing a shirt with a lot of Red Pandas on it. So what Red Panda is is it's a streaming data platform and it's fully Kafka API compatible. So if you're familiar with Kafka you've run any kind of Kafka workloads or you know run anything against the Kafka API Red Panda can actually be a drop-in replacement for Kafka for those particular workloads. Now where we differ in Kafka is you know we are really focused on modern hardware. So you know we're fully written C++ making use of you know a number of optimizations and things like you know making use of XFS as a file system for you know doing out-of-order writes and pre-allocation. We're really focused on low latency in comparison to you know Kafka is typically really focused on higher throughput. We can do high throughput but we can also do low latency too as well and we're able to do this typically with a you know large reduction in overall hardware. We have a we're much simpler to use or pretty pretty simple to use platform a single binary so you know easy to get this up and running or to deploy it yourself also to run it locally within your system. You know no external dependencies at all so you don't have to run this with you know any sort of like xcd or zookeeper or anything from that standpoint. We also really focus on doing F-syncs after every batch of messages so this is where you know these types of instances are very interesting to us because they handle a large amount of IOPS and so you know we can actually do IOPS very quickly. You know you're you're sort of p50th latency on these types of IOPS are you know within the microseconds versus you know persistent storage could be in in the milliseconds so you know for us like this these types of benchmarks and doing these performance comparisons is really critical for us and you know we also do use raft behind the scenes so we are a raft based system so this allows us to write out you know partial failures of an environment and also provides more predictable performance and other replication mechanisms as this allows us to have you know one slow follower in the system and not have that affect the overall replication and latency of the the system itself. The one other key thing that we have inside of Panda that is you know makes us scale or makes us possible for us to scale on modern work or modern machines is something called CSTAR. So CSTAR is a framework it's a you know a framework and library that's fully written in C++ it provides a you know really kind of an async programming model so you can kind of think of this as you know this was sort of put out there before IOEuring was you know really sort of popular and you know we get that question quite often of like how does CSTAR compare with IOEuring you know they are very similar in many ways but this allows us to do things like futures and promises. The other thing that this provides us is the ability to do a thread per core architecture where what Red Panda does is that when it first starts up on a system it will look to see how many cores we have available and then it will spin out that many threads and will pin a thread to a given core. So this eliminates any kind of locking or global locks that occur because we're pretty much treating every core as if it's its own system in a way. This minimizes any sort of IO blocking and reduces overall context switching that you might have within the system. So this works really well with ARM because you know we are getting physical cores and we're not on a hyper threaded core that might potentially be context switched as well so this type of architecture actually lends itself quite well to the ARM type of deployment. And just to kind of show like you know some of the you know that we are performance oriented this is some of the benchmarking that we've done previously with Kafka in comparison to Red Panda. This was running on a three node i3en6 extra large instances doing 500 megabytes per second and you know as you can see Red Panda is around two you know two three milliseconds of latency on average and our tellal agencies you know stay very consistent all the way up to the end in comparison to Kafka which you know Kafka if you're not familiar with is all Java based so it has a little bit of a more challenging time fully optimizing IO or doing fast IO so you know there you can see that kind of goes through on the latency side of things where we're seeing an increase in overall latency and an increase in the overall tellal agencies in the system and you know with those types of optimizations and being able to have this reduced amount of latency this really helps us reduce the overall footprint of Red Panda so I you know we've had many instances where people have gone been able to go from 50 node Kafka clusters to seven node Red Panda clusters just because of the optimizations that we in my apologies my computer decided to to take me on a presentation mode where we have seen that you know we've been able to reduce the amount of the amount of infrastructure really needed for you know for Kafka to Red Panda where we were able to reduce the number of nodes and with this reduction you know we bring down the overall cost of running a system like this and also you know reduce the overall amount of infrastructure and power needed for you know a system like this to be able to do the same exact workload with better latencies so enough about Red Panda let's get into the meat of things but you know I wanted to give that kind of context of why you know we would really care about this and why you would look at using Red Panda as you know something for investigating you know how Graviton instance types work in comparison to x86 but first before we could actually do that you know we had to also port our C++ code over to ARM so to make that work you know Red Panda is fully written in C++ you know we thought for sure as we were going through this exercise last year you know we were going to have to add some macros to the code or you know do some specific things to make this work but all we ended up having to do was to really just change the architecture that was used to actually compile the code so we changed it from you know going with AMD 64 to just ARM v8 and this just worked out of the box with actually no code changes needed on the C++ side. What also really helped here is that we vendor all our third party dependencies and build them ourselves so C star as I was mentioning earlier we actually vendor that specifically and build C star as kind of part of the build process for Red Panda so we're able to you know recompile all the libraries and all the dependencies into onto the ARM framework and just make this work seamlessly cool. So now getting into the kind of the benchmarking so what did we look at you know we wanted to do data intensive uh we're data intensive workloads and show the demands that that puts on both the disk and the network so we actually tested a few different areas um hence those three but it's actually four but we focus on disk bandwidth we focus on ingestion throughput bandwidth uh regulation which or regulator behavior which I'll talk about in a little detail at the very end which was kind of an interesting oddity that we saw and then also into in latency as well as you know it's always good to sort of focus on you know what does this actually look like when we're talking into in latency in a system in the two sys you know the two ones that we really focused on from an instance type is the i3 and I mentioned the is4 gen as well so without further ado let's get into what the results for this look like so the first ones uh you know we we did the disk bandwidth to see you know what could we possibly get from a bytes you know gigabytes per second um standpoint so uh you know first off we just used fio so just doing something as simple as uh you know 16 kilobyte block sizes so uh you know that is critical to note uh red panda itself uh our system the reason why we go 16 kilobyte block sizes is that that's typically the amount of data that we persist on a per i o operation uh we um you know we chunk our data into 16 kilobyte writes so you know doing this kind of gives us a pretty good comparison of what it's going to do when it you know we're running red panda on the system doing direct i o and then using lib a i o uh for the i o engine for this and then also making sure that we keep the i o depth uh you know the the buffers full as we go through this as that's something that's also done inside of red panda uh for further optimization so you can see here that you know the i3 and 12 extra large is about equivalent to the is 4g en 8 extra large uh and you know the comparison down the line is pretty similar as we start to get into some of the smaller instances we can see that the i3 en uh instances do pull out a little bit more um but you know you are also you have more cores in those machines uh in comparison to the is 4g ends for the amount of data that you have but once we actually move up into some of the larger instances uh you know we can absolutely see that you know they're they're very equivalent with the amount of data um or the gigabytes per second that can be put through on those instance types so the next test that we did is that we actually installed red panda on the different nodes and then we used the tool from lib rd kofka which is kind of the de facto or one of the de facto kofka libraries out there today it's a c++ library that's used by or behind the scenes by a number of different um uh drivers out there whether you're using like a python driver or other things from that perspective and they actually have a tool within lib rd kofka code uh rd kofka performance that can do um you know can do throughput or producing to the machine so we had uh we were just doing producing to these uh systems so we're just producing data do it to it using client machines that were c6 i8 extra large instances using a one kilobyte message size so you know this chart looks very similar to our previous chart there's not a huge amount of difference you know we're we're talking about maybe a couple uh you know hundreds of megabytes if that when you know in the comparison between these systems so you're not a huge amount of comparison difference between the two but when things get where things get a lot more interesting for us is when we actually start to look at the cost factor to this so the way that we did this was we actually uh looked at gigabytes per second for hours per dollar so how you know how much gigabytes per second can i get on a per instance basis per the cost of that instance itself and when i look at it when we look at it from that perspective we see that the is4 gns have quite the advantage here um you know they're able to uh you know to get us about a 20 percent better value in comparison to the i3 en instances so you know this is this is pretty fantastic to to be able to see a 20 improvement uh you know from a cost perspective and still be able to get the same level of performance uh you know is uh is pretty phenomenal the other thing that we saw through some of our testing so we were also testing on some i3 instances and looking at some smaller instances you know we're always interested in how small individual or you know how small of an instance can we run red panda on uh as we have customers that run us in sort of iot devices or edge devices where they only have one core available or other things from that perspective and so you know we did some testing on the smallest instances the i3 large um and also on the is4 gn large and one of the things that we noticed was that there was a kind of a smoothing of throughput over the system this should actually be timed down here not gigabytes per second per dollar but um what you can see is that this is the amount of data that was able to be pushed through and we see with the i3 large instances that this spikes at times and then just drops down to zero and where with the is4 gn there was a better regulation or kind of you know regulator smoothing uh that was happening uh within the environment that allowed for a more consistent throughput of that system so you know providing sort of a more consistent performance versus you know sort of the spiky behavior uh that we saw with uh you know some of the other smaller instances specifically the i3 large instance there are also other factors to consider here you know we are mostly focused on in this particular benchmark storage and you know really sort of network and you know we can look at the different instance types of like what are we paying when it comes or you know what is the value that we're getting from like the amount of vcpus and you know this makes it a little challenging because you know for the is4 gns vcpus is the same as cores uh which is not the case on the i3 n instances and we can also see like you know what are we getting from a value perspective for uh the amount of gigabytes per dollar that that is spent uh or for the you know sort of the cost factor that's spent and also for memory and for uh network and in this comparison you know larger numbers are better um because that means that we're getting more for our money uh versus you know the smaller number is uh you know a bit less so what we're what we see across the board is that for the is4 gn instances we have 17 additional ssd capacity 18 more guaranteed network capacity and you know we are getting a bit more memory uh on the i3 n instances and we are also getting a bit more uh from a vcpu perspective but you know the the things that really focus or they're really critical for us in our systems is network and disk uh and for streaming systems that's typically the the case so you know this is actually a great sort of outcome for us um for a system like red panda so now on to end to end latency so using the same system that are kind of the same benchmarking tool that i was mentioning earlier the open messaging benchmark uh you know this is a uh a framework that is used for doing comparisons of different stream processing engines so you know if you're comparing a um you know rabbit mq to kafka uh or to uh pulsar you know open messaging benchmark you know it is a linux foundation project i'd highly recommend take a look at it we have our own fork within um red panda as well that just has uh you know the specific red panda drivers and and deployment scripts and things like that inside of it that you can take a look at as well uh instead of our github repo but i wanted to do a uh we also wanted to do a benchmark of just what the end to end latency would look like for a few different instance types so going across the board i kind of wanted to stay right around the thousand dollar per node cost basis uh this is the you know costs for running inside of us east two um with uh just doing on demand with no sort of cost savings or anything from that perspective and this is a one node monthly price uh on a per instance basis here so i wanted to look at the i4is because i found them kind of interesting you know they are sort of uh yeah uh a number of uh you know redis and sila db had uh been referenced as talking about their their performance and some of the benchmarks that they had done on it so you know really kind of focused on the the high uh i o uh space and then also bringing in the im for gn and then the is for gn two extra large and then the i3 en three extra large which is what we compared uh you know in a similar fashion to the is for gn two extra large and we can see here they're all pretty comparable in price the is for gn two extra large is definitely you know the cheapest option and we can see that you know the i am for gn four extra large is a little bit more expensive than our i3 and three extra large but one of the key things to note here is it's a guaranteed 25 gigabits per second of networking which is pretty fantastic for aws whenever you see up to 25 gigabits per second in aws what my experience has been is that that typically means you get 25 gigabits per second for you know around 8 to 10 minutes and then you get throttled anywhere between 8 to 12 gigabits per second for the remainder of the hour so you know it you have to always do your testing with iperf and other things from that perspective to fully understand what this really means um but it is always great when you can have this guaranteed amount of networking throughput so you can take that guessing game out of you know is it the network am i running it out of credits on my network uh connection and you know not having to use that as or have that as another point of investigation for these systems so what does this look like so this is the output from the open messaging benchmark we were running a workload of 715 megabytes per second of throughput one kilobyte message size i had two client machines running so two m5 and four extra larges and here you know this is the average end to end latency of the system um we um you know for uh the i for i instances you can see that you know that provided us the best overall performance um and then you know the second uh sort of the second contender in there was the iam4gn instance where you know pretty pretty close uh and you know that these this was a test that was run for uh an hour or we ran these you know all of the tests for an hour uh you know we can see that it's pretty very close to what we saw with the i for i instances i and then also the i3en3 extra large uh you know is uh is in third place here and then the is4gn 2 extra large uh you know comes out still in a pretty good respectable latency overall but you know definitely um it you know suffers a little bit based upon the fact that you know we have reduced number of cores in comparison to the other system so and one thing to kind of really note here too going back to the instance types and what we just saw from you know the overall performance is you know the i4i gives us the best performance but only has half the amount of storage in comparison to all the rest of the instances so um you know even though it can uh the storage is pretty amazing and some of the testing I did on the storage I was able to you know for this instance type the i4i4 extra large I was able to see 120,000 IOPS per second uh or yeah 120,000 IOPS on a single drive there and also two gigabytes per second of throughput which is amazing but the challenge with that is that the network couldn't keep up with the actual drive performance so you know having a system where you might be very IO intensive but not have to do a lot of network IO you know the i4is are very interesting in that regard but you know for red panda itself and for other systems where you are focusing on both disk and network you know the i4gn in this case um you know really does well and is a you know is a great contender also it's really important to look at what the end percentile latencies are kind of uh you know what the uh tell latencies of the system look like so we can see here that you know the i4i and the im4gn are pretty you know are pretty much neck and neck in that regard and we see a little bit higher latencies with the i3 en3 extra large and then the highest latencies with the is4gn 2 extra large and that should be expected as you know we have half the cores in the is4gn 2 extra large in comparison to the i you know im4gn if we go back to here we can see it's even uh you know we have half of the amount of cores from one system to the other in the comparison of the two so what does this really tell us this tells us that you know the im4gn 4 extra large you know provides a great balance between sort of performance and cost and overall storage capability um and then the i4i instance is interesting uh you know we it is something that we'll probably you know ourselves here at red panda will do a bit more investigation into because for cases where you know we want to have the highest amount of io throughput and then maybe age out data to something like s3 or to an object store uh you know the i4i instances might make sense but again you know it always goes back to network io so you know if we're doing operations back to s3 we might cut into our network and then might not be able to even take full advantage of what's available from a disk perspective so you know if you're running database operations where there's compactions and other things happening quite often the i4i instance might make sense but you know in general it you know you might not be able to actually make full use of the io available of the disks themselves uh if you can't drive it enough from a networking standpoint and then the im4gn 4 extra larges you know we have that seven terabytes of disk space for 1065 per month uh in comparison to uh you know the i4i 4 extra larges where it's just you know half the amount of storage um for you know about a thousand dollars a month so you know it kind of you know always interesting to sort of see the difference of you know what are you going to get from a price performance standpoint and the um you know the the overall outcome of what that looks like when you're actually looking at the entity and latency of the system itself so what you know what are our conclusions and how do we you know how do we summarize all this at the end of the day so as we saw in some of the earlier tests you know the arm instances have that 20 advantage in price performance um you know we see smoother regulations on the smaller instances so you know I think it's really clear that for these smaller instances running with the arm processes or the arm processor is you know is absolutely a great way to go as you get you know sort of more predictable performance and uh you know you're getting a pretty you know a pretty good ratio of price and performance with the smaller instances so you know if I was running any kind of smaller workloads I would really look to to run that on arm today and then the IM4GN4 at your large provides a great price performance comparison for end-to-end latency and you know the I4I instances are so relatively new you know still worth a little more investigation um but you know definitely some concerns there that they're not enough network to really help or to really drive the you know the full amount of IO that's available on those systems so so with that I'll take a quick look on um you know what some of the questions are uh so all one of the questions is you know are we what kind of AWS storage instance types that we're using uh they are all SSD and VME backed so we we didn't want to um we didn't really want to look at uh any HDB based back systems uh because we were you know really kind of focused on like what we could do or what could we drive from a IO perspective and uh you know you're going to get the best IO and the best latencies from using local NVME SSD storage we do use GCC for compiling uh I well we have multiple different ways in which we can compile right panda uh you know LLVM and ceiling is is one of them um you know we do have more information if you actually go to our github repo you can find how exactly where uh you know how to compile red panda it is red panda itself is a source available so you can go and download it and compile it locally and you know we have all the build scripts within the uh within the repo itself so the features of the ARM processors that can treat it to the ingestion throughput results you know it what we saw was that you know we didn't really we didn't do a huge amount of performance comparison on what we could do or you know what we saw from a IO or sorry a CPU perspective but what we wanted to really uh compare was you know were were we you know the amount of CPU that we had available for being able to drive the full amount of IO um because you know for us what's really critical is to make sure that we have enough IO uh or enough CPU processing to be able to drive the IO of the system and that was really the the key thing that we saw with this is that you know what was that what was that you know of the ARM architecture uh and these ARM instances provided by Amazon you know what best uh sort of you know throughput and network IO could we get and that was really the the the point of this comparison to really see uh you know what could we see in comparison to sort of the standard Intel instances and you know as the conclusion as was I was showing is that you know we can absolutely drive the IO and network with the number of cores that are available there uh you know typically it's being done with less amount of cores and that provides us you know that the that 20 percent um sort of value savings over what we see with Intel so you know we're able to um you know get consistent performance that can drive full IO optimization of these instances so there's a question around uh what the bandwidth regulator is this uh this seems to be something that was added as part of the Graviton 2 instances or what was in a huge amount of information really presented by AWS of what was happening in this regard this was more just an observation that we had made in the systems of you know the comparison of what we saw with the i3 large instances and with AWS you know as anyone who's done performance testing in AWS realizes is that a lot of times it's a bit of a guessing game understanding you know what exactly is happening behind the scenes am I running into you know am I running out of credits for the particular network you know network for the particular instance that I'm running on or you know how are they splitting up a SSD disk across multiple instances because if we go back and look at sort of the in the earlier part of the presentation and look at some of these instance types I you know especially for like the i3 and uh i3 and instances you know your standard drive is 7.5 terabytes so when we're looking at these really small instances you know we're having that drive be split up across multiple instances and at that point you know there there has to be some type of regulation in place to you know not allow you to make full use of you know all the IO capabilities of that underlined device and have it be split up evenly across multiple tenets or multiple instances or making up that drive so you know AWS does a number of kind of different tricks behind the scenes to try to help with that and we just saw that in general you know the in the ARM instances especially the smaller arm instances where there is a you know sharing of the underlying drive across multiple instances that this regulation was much smoother than what we saw with some of the other older instances so it's it's hard to say what that's exactly attributed to but just something to be aware of as you run on these types of instances inside of AWS great I think that's all the questions that I see here at the end you know if there's any further questions about about Red Panda and what we've done here you know feel free to come and join our Slack channel you know ask questions there we have a pretty you know active Slack community you can always reach out to us and we're happy to share a little bit more details around Red Panda and some of the testing that we've done on the different instance types we also have this actually a lot of this data came from a blog post that you'll find on our website too as well where we have you know a lot of this information published in a blog format that you can check out after the fact and I think is also being posted in the in the chat as well so you can take a look at and yeah that's that concludes this webinar for me and thank you so much for your time and for your questions and I appreciate your attention thank you so much broco for your time today and thank you everyone for joining us as a reminder this recording will be on the linux foundation youtube page later today we hope you join us for future webinars have a wonderful day