 Yeah, some code and benchmarks and codes and benchmark and Hopefully that's gonna be useful for some of you So I'm a software engineers engineer for elastic in the lock stash team. So lock stash is written Jirubi entirely And I have two colleagues here guy and perry We're working with me My team So yeah, we're gonna we're gonna talk about Fast IO or how to achieve efficient Persistence using a Jirubi and Java So this is this is work. I've been I've been doing to For the for the actual implementation of a persisting queue I'm gonna and I'm gonna show you that a little bit later on So if this is this is too boring for you guys you can jump to the code right now There's a Jirubi conf 2015 repo with with all the slides and the code and the other two repose and map and MAP cues are the actual Mapping implementation and the cues implementation Okay, so first of all we're gonna talk about Using data or manipulating data in Ruby Yeah Yeah, so Soon enough you're gonna be asking yourself why there's no byte array in Ruby or you know, I wish there was a byte array In Ruby so in Ruby when you manipulate data everything is is true using you know using strings actually to hold the data and There's two important things that you have to consider when I when doing that when the you know Manipulating data is the character set and encoding with your strings and if you're calling out to Java Java classes then you have to Pay attention to the actual type conversion that occurs between Ruby a Jirubi and Java Now there's a very nice Wiki for that that explains some of the pitfalls and some of the the techniques that you can use to Get more speed when we're doing that So why I urge you to actually look at that if you ever want to play with you know crossing the world between Ruby and Java Okay, so first of all, let's talk about encoding I'm not gonna go through all the details for encoding. I'm gonna talk about a little bit the Ruby side and the Java side So these are some of the stuff that that you may have seen for encoding of course the the first line The encoding for the Ruby files So the the strings that you're gonna define in your Ruby files are gonna be encoded in that character set So UTF-8 in that example There's classes and constants that you can use a UTF-8 ASCII 8-bit and This is important to understand what ASCII 8-bit is and we're gonna see the equivalent in the in Java Which is ISO 88 59 underscore 1 This is actually a transparent 8-bit character set. So it's equivalent to saying no no encoding, right? and So main most of the these encoding Methods are in the string class. So you have like force encoding encode Encoding which which is gonna give you what the encoding that the string is in right now Is it a valid encoding and and the actual byte size of a string? So it's inter it's it's important to note here that The only thing that does transcoding in here is the encode method So force encoding doesn't do any transcoding. So it's only a tag for the string to tell it in what encoding it is So I'm gonna show you some code that uses that So the equivalent in the in Java So we have like the default character set in Java And then there's some some classes and constants again for for the encoding So you have utf 8 and then like I said ISO 88 59 1 which is the 8-bit transparent encoding You can set the default encoding in your Java application using the the file that encoding You can get it With as a property or you know if you're you're from Jirubi you can use the end Java file dot encoding to see what the default encoding is in Java Okay, so let let's show an example here if I define a string with Ericss Antigu or a cute accent e Obviously, this is a unique code UTF-8 is going to be a two-byte Encoding so we see that the encoding here is utf-8 if we ask the size. It's it's a one character String if you we check the byte size, then it's two bytes If we force the encoding to ask you a bit which is going to be a transparent 8-bit encoding We get back that that string and if we ask for the size it's going to be two and bite sizes too So this is to show the difference between you know a string length and a bite size for that string that same string Okay, now let's talk about object persistence so And this is why I'm obsessing with string and it's very important when we talk about Persistence and IO it's important to to you know get the the the big picture about encoding and make sure that that you actually Understand that because you may end up having you know wrong encoding and problems with that So The problem is that all Ruby IO use a string so you you don't have a choice you have to go through strings one of the problem that we have is that if you are a Jruby object if you want to do persistence Cannot be serial it cannot implement the serializable serializable interface in Java So you cannot benefit native native a serialization or really faster realization using I don't know Frameworks like cry or stuff like that. You actually serialize your object within Java This is because Jruby objects actually hold references to the runtime and and we cannot serialize that I know that there's been some work To actually be able to have serializable objects in Jruby There's some open issues about that, but it's not there today So basically to persist an object you need to serialize it somehow To a string because you want to write it, right? so You can use you could use Marshall dump, but that's the same thing. It's gonna serialize to a string also Jason and so on so so you're stuck with strings So so yeah, we're gonna we're gonna play with strings and We're gonna do some benchmarks with different strategies all the code and example are gonna use string object. So I'm not gonna talk about serialization. This is gonna be your problem to choose whatever method that you want to serialize You can talk to guy the other of junior Jackson here or maybe a Satoshi for a message pack go see these guys But here for the sake of these these examples We're just gonna deal with with strings objects string buffers and then persist that But of course in the real world You're probably gonna have some serialization cost to include into these these benchmarks are Okay, so what's the motivation here? this is so I work on log stash and This is basically log stash the log stash pipeline. It's just three horizontal cylinders and two vertical cylinders Connected with our roads. Basically, that's it. That's a few years of software engineering right there So we use so these these are the stages the input filter and output stages in in lock stash and they are Connected through a size queues and basically we use size queues there to actually Propagate back pressure when the outputs are slowing down So these size queues gonna fill up then then the the filters won't be able to Push an event in there. So the the other size queues gonna fill up and so on and then it's gonna propagate the back pressure back to the input the input plugins or the input stage So the idea here is that these size queues are in memory queues So if there's a crash a system crash and application crash or whatever We're gonna you're gonna use all these in flight events and in lock stash These are smallish queues, but nonetheless this can cause a problem, right? So one of the One of the way to to solve that is to see can we can we persist those events in these queues? So there's many solutions to that problem, but one of which is saying, okay, we're gonna do Just a persistent queue implementation in right there and and that can be just a drop-in replacement to the In memory queue and then be done with it It's going to be persist and if it crashes then when the application restarts I'm going to read the persisted queue and then and then continue on with the events so that leads us to trying to Find out the best way to persist because lock stash processes, you know Thousand hundreds of thousands of evin per second. So we need to be really really fast in terms of persistence so RIO performance or you know storing as many objects in the last time Possible and oh This is not Satoshi here in the picture Okay, so different strategies that we're going to explore We're gonna the base the base Benchmark that we're going to use is you know plain ruby falayo Obviously after that we're going to see Can we do better with m mapping so memory mapping? I'm going to talk about that We're going to have a java class implementation And we're going to test different strategies to talk to the java class. So implicit casting explicit ruby side java side casting play with the character set a little bit We're going to build a j ruby extension in java And then have a pure ruby implementation also and and check the the performance results Okay, all benchmarks Basically, these benchmarks are for right speed only If we have time I'll come we'll we'll see the actual queuing implementation Which you know reads and writes but these basic benchmarks are for right speed We're going to use a one four and sixteen k buffer buffer size again. These are plain string buffers And we're going to be writing End times two gigabit two gigabyte files So the end times depends on the on the test. So I was just I just wanted to make sure that it takes long enough to get you know results That are in the few second range So this is this has been run on my MacBook Here it has a Local SSD. So of course, there's a lot more IOPS than than a spinning disk Latest java and latest j ruby 17 I haven't run the test on 9k yet, but I'll do that Okay, so standard ruby file IO So if you want to go in the repo and check the the implementation I have a method there that's called bench And it passes me a buffer in a right count and and the only thing that I do So this this bench Method is actually benchmarked. What's happening in in that block? And I'm not counting the creation of the file So it's just going to go on and Write the the buffer that's giving to me. So it's going to be you know one k four k 16 k and so on So So this is the first result that we get this is going to be our our base result So we see that we have you know between You know 680 and about 900 megabyte per seconds in terms of throughput With with file IO with standard file IO Okay, so now an alternative is to use memory mapping so Conventional file IO use read and write system calls and that involves you know copy operation between the file system pages Incarnal space and the memory area in user space. So there's always copy going on But memory mapping IO Uses like virtual memory mapping from the user space directly to the file system pages and with a memory map file The entire file is going to be accessed using a byte buffer class And a byte buffer is that is you know, that's it. It's a byte buffer So you you actually manipulate your file the underlying file just as a byte array And you put bytes and get bytes so you don't have anything Like like read lines or enough files or anything like that. It's just a big byte array that you manipulate Okay, so one some of the advantages of using memory mapping So you you see the file as plain memory like I said as a byte buffer There's no need to issue read or writes If you access or you get bytes from that memory space, there's going to be page faults In the os that will bring the the the the file data from this to these these these memory areas So if you put bytes or you modify the memory map space These pages are going to be marked as dirty and are going to be flushed to this eventually So the os is going to be performing the caching and managing memory According to system loads and available memory One of the important things that the data is always page aligned. So there's never Buffer copying that's required. So this is this is probably one of the biggest benefit And also very very large file. I think it's up to four gigabyte can be mapped Without consuming actual large amount of memory, right because the data is pulled in as as as needed Okay, a few notes If the user process crash The the memory map file is actually intact So because those bytes have been managed by the os In the pull the plug situation Well, just like a normal file, you don't know exactly unless you've been doing You know a flush or f sync, right the equivalent of that In in with with m mapping is to use a force So force is the equivalent of flushing and then doing an f sync And these tests we're not doing that So and this is this is a strategy. So just like the file i o f I've not done any flushing or f sync in the file i o test I'm I'm not doing any any force or flushing with memory map And the performance of memory mapping is going to be relative to your file system type the the free memory that you have in your in your system For doing the file system cache and the read write block size Normally M mapping should be much faster than streaming i o This is what's to be expected Okay, so first We're going to look at a simple java implementation for memory mapping now. I don't know if I can pull that in here so this is Very simple implementation. So in the constructor here. We see that we create the file We we we get a channel And then we call map which establish the memory mapping The m mapping to that channel and then The methods here are are really Just a wrapper against the the byte buffer put and get So it's very simple implementation Okay, so we get we have different put here that we're going to use for the different benchmarks. We're going to look at that Okay, here we go. Okay. So the first one So using the java class, we're going to use implicit casting with the default character set that means that we're going to pass in Our ruby string To a java method that accepts a java string, right? So we can see it We we use out that put bytes buffer buffer is a ruby string And below this is the actual method java method which accepts a string A java string so and it it gets the bytes. So so data that get bytes And and we use the buffer Which is the memory map buffer that that's been defined And we do a put for those bytes as simple as that Okay, so in this case Using a mapping we're actually slower than falayo And and by a good margin. So this was a really a What the hell is happening here? Okay, so why is it slower? It's supposed to be much faster. What's happening? Um, so the first question was okay. Is there is there encoding transcoding going on? Um, so let's see if we can you know specify explicit encoding here Um, since our string here are defines that simple, you know a b c d f And then we do a force encoding On as ASCII 8 bits. So our string are already, you know 8 bit transparent and And but we see that the default character set in java is utf 8. So okay, maybe there's transcoding going on So instead of using the get bytes the default get bytes Which is going to be using the default encoding in java utf 8 Let's use the get byte and specify our character set, which is going to be iso 88 um 88 59 1 right So this is what we're going to do here instead of doing a put byte buffer We're going to put byte with the character set and this is going to be passed onto java And there's going to be explicit character set here to avoid any transcoding And this is what we get. So it's a little bit faster So this is not the transcoding problem. So what's happening? So of course here we can look at, you know type conversion. Is this is this what's happening? Um, we see that, you know, we use a buffer a ruby buffer and we pass it To a java method that uses that accepts a string a java string So let's try and do explicit ruby side casting Instead of, you know relying on the j ruby Implicit type conversion So you can see on the upper part where we do the the out the put bytes We're going to do buffer to java bytes So this is going to pass in a byte array a java byte array to that method And this is going to be using below the java implementation of put bytes which accepts a byte array Okay, so instead of so we do an explicit ruby side type conversion So what kind of performance do we get? Um, it's a little bit faster, especially with the bigger blocks So at the 1k block, it's somewhat similar And then 4k and 16k it's getting it's improving Um So this is getting interesting right, uh, we can see Uh, a very big increase in the 16k blocks Um, and um, we're going to see if we can improve on that So We're going to do we're going to try Some explicit java side casting so instead of doing a two java bytes So we're going to you know use a new method that's called put ruby string with the buffer with the ruby ruby string buffer And use and create a method that accepts a ruby string object Um, and then on that ruby string which is data we're going to do a get byte list So get byte list on the the ruby string class actually gets us a byte list Class from which we can get the bytes. So there's two ways to getting the bytes you can get Safe bytes by doing a copy or you can use unsafe bytes, which is going to give you a pointer to you know To the actual underlying byte buffer For that string. So this is what we're going to do here so Okay, so this is getting really interesting We can see that we get up to five gig per second With the one k block and and and you know seven gig eight gig per second for the the four k and sixteen k This is very very good So so we see that the you know The cost of the implicit conversion in j ruby when crossing the world between j ruby and java is very very expensive And and there's basically two ways to avoid that from the ruby side Or from the java side. So when you do it from the java side, of course You have to know a little bit more about the the j ruby api in terms of you know And and especially ruby string. I don't know if any of you have checked the ruby string class implementation It's probably one of the biggest class Yeah, it's it's just amazing So I won't go into the details, but uh, okay, so now Another implementation. So instead of doing, you know, uh talking with a plain java implementation Let's try to create a j ruby extension extension in java and And so our benchmark is pretty much the same. So we're going to use, you know, put bytes and buffer our ruby string and then below we can see the the A j ruby method defined as in java So there's there's some boiler boilerplate code in there. It's basically to check the arguments But at the arrow we can see that we do a buffer put With the actual so we know that the ruby object there is a ruby string So we can do it approximately the same thing is Use the byte list and use unsafe bytes To avoid the copying of the bytes and And that's it. So Performance for that is somewhat similar to To our explicit java that we had but for, you know, a little bit faster for the 1k block So we shave a little bit more time here And if we compare that to instead of using unsafe bytes, so for me when you do persistence Uh, I think it's it's pretty safe to use unsafe bytes because this is usually the end of the story for the string Right, you're not going to mutate the string after that. You're just saving it and you're persisting it But if for whatever reason you you actually want to copy those bytes to get a safe string to persistence Then you can do that instead of using unsafe bytes. You can use get bytes which does a copy And the difference in performance is this so we see that it's pretty significant in terms of performance cost Okay, so the last implementation is a Um Is simply j ruby calling into java directly, so it's a it's a pure ruby implementation of that M mapping class that we've created You know in java or as a j ruby extension and Basically, so the same put buffer buffer put bytes buffer and then below we can see the implementation So so there's the you know the construction So in ruby we're calling the the the the actual java class to create the the m mapping and then eventually we have to put bytes with data and then We we do the same, you know data data to java bytes and so on so performance that we get with that is You know pretty similar to the to the explicit ruby performance. So not that good okay, so um Like I said the uh the motivation for that was to actually implement a persisting queue or persisting size queue. So Uh, i'm gonna take a few minutes. So I don't know how we we are with time Yeah, okay, so i'm gonna go through those slides so This is a schema of the persistent queues Implementation so there's two implementation. There's the you know a standard queue Just the same same api as the tread queue and then there's a size queue Which are you know blocking and tread safe implementation? so they rely on The page endlers and the page queue implementation So page queue is simply a non blocking non tread safe base queuing implementation that uses the page endler To do the m mapping creating the pages and and storing the metadata and and so on So those page ender can have different strategies and there's two strategies So there's one which called the the page cache which caches the the the last uh use pages Because typically you have two active pages when you're doing queuing So there's going to be the tail page and the head page The head page is where you push And and these are Append only pages that are created So and then there's the tail one where you actually put the items right And there's the single page strategy Which is useful because in the size queue implementation Typically the number of of items You know in the queue is small So you can actually use the same page memory map and then just do a ring buffer in in it So that avoids creating more and more pages which is costly using a mapping So this is what I just said So typically when when persisting data, it's just append only pages and these pages You can define, you know the size that you want I'm doing, you know, and we're going to see some more benchmarks, but I'm using two gigabyte pages And there's metadata, which is an imab file itself that that keeps pointer to what is the tail, you know the tail Page index and the tail page offset So where do we read in that page and the same for the head for doing the push on the queue? So just a word of caution If you play with that, you know, this is work in progress and you know proof of concept code. So obviously You should be careful Again, there's no serialization involved These tests with the queuing have been done with one k string objects only The the the map page size is two gigabyte We use a two item page cache and and The for the test we push two million items per producer. So sometimes we have multiple producers. So that's going to be two million per producer Okay, first the persistent size queue It's a limit limited queue size For this implementation, we use a dual queue implementation We push to both the persistent queue and the in-memory queue the in-memory queue is actually just an array Um, so we serialize to push the persistence Persistent queue and we push the original object in memory queue and then when we pop We actually need to only pop from the in-memory. So it avoids deserialization cost And we just update the metadata on the persistent queue So the persistent queue is only there if there's a crash and you need to reread The the the queuing to actually uh, not lose data in that case Okay, so what do we get um with page cache We get so we have one consumer one producer one consumer to producer two consumer one producer and two consumer to producer treads To try different strategies. So we pop and push in those queues. So we get Uh, approximately 100, you know ish megabyte per second in terms of throughput with a size queue Or you know 100,000 Uh, 100, you know between 100,000 and 150,000 uh transaction per seconds So with a single page we get a little bit faster Especially in the the one consumer one producer Okay For the the persistent queue implementation. So this is not the size queue. This is a standard queue that's going to grow indefinitely If if they're the consumer are not catching up Uh, and the push and the pop operation are done are persistent We need to serialize on push and we need to serialize on pop And this is essentially just a thin thread safety wrapper around the page queue implementation Okay, so for read and write we can see that we get a little bit faster Okay, I'm almost done And so for read and write operations So we have consumer and and producer at the same time If we do a write then read then we get a little bit more performance and if we do only write Then we can get up to 500 megabyte per second on only writes So just a few notes Do we really need you know dual queue implementation for the size queues it really faster? I don't know I need to test that The the caching strategy is it optimal? Can we find better page size and cache size? How does that perform on spinning this that would be interesting? I I haven't tested it Is there a faster alternative to the the current page and metadata algorithm that i'm using I don't know You got to try that And the code has to be reviewed in terms of resiliency, you know doing the force do we need to force? Maybe at specific points Um and the last thing you know the elephant in the room, of course, it's the serialization There's a huge cost associated with that so Again, you can talk to guy or uh to set her sheet about that And thank you that's about it