 You do need a lot of dependencies, and it's not just Ruby either, one specifically for all NAT people using NVIDIA chipsets here, you actually need the NVIDIA development driver and tool kit as well as Ruby 1.9 and JV 1.6. So I wonder if we're doing it now so we can relax when you get started a little simpler. So before we really kind of dive into it, I want to just challenge everybody. Now when it comes to concurrency, when it comes to things like multi-threading and scaling and Ruby, I swim vehemently against the current. And there's a lot of main asset pros on, you know, preventive architectures and shared nothing architectures. I'm not dismissing that, but personally I think that we are essentially amputating our computers by refusing to deal with threats at a level. There are a lot of complications that can come up, you do have to deal with sharing and everything like that. However, we also have a great tool that helps us deal with threat scheduling and sharing things, it's called the operating system. So the operating system provides a lot of space for us, and there are some things you have to learn about concurrency. And I know that there's going to be, I think so much to use for you as well that are saying to the opposite. So, I think that's good. Okay, so I want to talk about a synthetic example, but I think it's good to have at least some sort of use case. So let's find the area of a tree grain. Once you're in a circle, give a raise. Progress grant, okay, good, everybody's in the recursion. So, now what if in a really old tree, we're developing some sort of tree grain simulator or something like that. And we want to find the area of every tree grain simultaneously. Well, okay, so let's just justify the result of what we're going to do is correctly the area of a light blue circle inside that's higher squared as the entire area. But if you wanted to find the area of just a single ring, say this dark area, the dark blue ring, you can say higher squared, give a raise, five, and then subtract the area of the previous ring, right? And then you get a further ring. So we could have, personally, we could do a separate function. It's essentially the same operation just with different radius every time, so there's a lot of ways to do this. And we're interested in finding the area of every very old tree as fast as possible. So our first attempt, which is very simple, we have our basic formula. Right there, we find a simple function that given the radius and we return the area of just that single ring. This is not recursive, it's just the floating point arithmetic and it returns a raw float value that we deal. Now it's a forwarded given number of rings, doesn't really matter, just put that in a simple for loop and print it to the console and we're done. So that's something to deal with, we don't need any special libraries that are going to be by itself. But we have a lot of additional system resources and maybe we're trying to have this whole database of trees and no one's just going to sort of serialize everything. So we might be saying at any time, this is multi-framing. And so this one is essentially the same thing, we have this deaf ring area at the top of the radius, that function remains unchanged, but instead of the simple for loop, we just have a few more lines. Want us to create a thread, this thread.new right there and then essentially just for say four threads, you chunk the amount of work. So if you have four threads, you calculate 25% of the number of rings for the thread and we'll have the operating system figure out on CPU to run all of these threads on. So there you have some big performance speed up. Then of course we wanted to make an even lower level of times, you could maybe third or fourth attempt using a cross seed, which is almost as fast as you can get without getting insane or any assembler or machine code manually. And this is all in the project and I'm really going to go through this in detail once. Suffice to say it's the same thing just in seeds but it's a little bit ugly. There's this particular file's a single C file and provides both a single threaded implementation as well as a multi-threaded implementation using key threads, which are kind of the POSIX standards. So for all the map users on Linux users here, which comes as all computing, right? So here we go. One knows I'm not a person to ask for it. But it's a little more complicated and you can certainly see that the entirety of the code and you can't get on a slide. So there's clearly some more things to consider. But at the end of the day, when we're talking about, again, Ruby Java, you can't just print anything down about the gem install of Fubar. It's going to usually fall in one of these first four categories. One is pure Ruby, great, hopefully no major dependencies on this one. Secondly, pure Ruby, multi-threaded, so friend-safe, not a lot of gems, unfortunately, thread-safe. And actually, most of the time, when you are looking at a given gem, you won't even save this thread-safe. Please don't do that. De-specify if a gem is thread-safe or not. Third, we have a lot of native, or Ruby gems with native components, data-based drivers, that's, you know, for example, data-based access, you want that to be super, super fast. So a lot of the times, my SQL or SQL life, most of the best drivers, will have some sort of either single-threaded or multi-threaded implementation, three or four are right there, which will generally give us a bit of speed up. And then five, which isn't really code, but this general paradigm of dividing conquer. So, okay, we've optimized our library or function as much as possible, but it's still not fast enough to be able to just steal out horizontally, use clouds, use super computers, do something, and just get it to turn out with input. So let's go to some code now. So I want to run a few of these examples. Now, the first two are gonna be in this first example. This is using a Ruby 1.9.2 Ruby interpreter. So just print your hands, who thinks that the first implementation, the single-threaded one is gonna be the fastest? No way, okay. So who thinks the second implementation, the multi-threaded, runs about the exact same amount of time is gonna be faster? Also, no way. What? Okay, so I'm not sure exactly how to do that. Okay, so, but it's a question. So let's just forget optimization. Let's just actually run this, make sure I'm using the right Ruby. 1.9.2, hurry up, yeah, good job, thank you. Woo-hoo! We'll run this, so it's getting a little bit, so it's running a serial calculation on the CPU with about 16 million tree rings, pretty old tree. Took about seven seconds, or 7.6 seconds, running a multi-threaded calculation using four CPU threads, my system has four lot of CPU cores, and it took 7.7 seconds. It took slightly longer to run a multi-threaded implementation of the Ruby 1.9 than it did the single thread. Let's skip over, I wanna show you, we're gonna run the exact same script in JRuby. Single-threaded using the CPU, five seconds, parallel calculation using four CPU threads, five seconds. So this does vary slightly depending on what else is going on in your system, but the multi-threaded implementation on JRuby took significantly less time than the Ruby, the Ruby 1.9 version, despite the fact that both supported the operating system threads. Okay, so let's skip a little bit of why that is. So first of all, if you think about your CPU and just the kind of way we all write applications on a regular basis, I'm guessing at least nine out of 10 people here if not like 98 or higher percent of you have a system of multiple CPU cores in a single die. That's just standard of these cities, especially for development machines, it's unlikely unless you just only want to spend like $200 of price on electronics on a laptop and probably have multiple core CPUs. Ruby 1.9 just might have to be able to run the whole threads natively because of the global interpreter lock and only use one of these cores at a time. I don't wanna get into the arguments on why that's a good thing or a bad thing, but it's nice to say that for a given reprocess, you're only gonna want to use one CPU core. Where JRuby is capable of using all cores simultaneously because it doesn't have that same limitation. Of course, there are some potential implications and the issues of the long run. But if we go back to our code here and look at the Ruby 1.9 example, this time I'm gonna open Activity Monitor and this is very small, but just keeping a look at the top item. I have 100% CPU usage. And that sounds good, but it's actually bad because it should be 400% because I have four CPUs. You don't have any multi-threaded version. We skip over here. The JRuby implementation just to prove ourselves is doing what we think it's doing. Our CPU, when we're running the parallel ones, let's see if there's a second error, 200% or 256%, 358%. So we're actually, you can see I got a really graphic, but we're getting a big CPU spike. So really out of the box, we're getting some big performance information here. Okay, so the common number of CPUs is usually some of the arguments we can make if not supporting running threads in multiple cores. One is in the block performance. It's kind of nice to have this shared, I think, single-threaded architectures because you don't even have to worry about blocks a little much of the time. We don't have any insane numbers of threads for hosts. For example, if we want to figure out, we're gonna spin off 16 million threads on our operating system, it'll laugh at you and say, no, I'm not gonna do that. That's just way too many threads. We're just in the process. They'll probably topple over. I'm gonna swap in, testing multi-threaded applications is extremely problematic. Everything that you know about TDDBDD starts to break down when you're testing applications that execute in a non-deterministic manner. Unfortunately, there aren't any frameworks in Ruby that I know of that deal with that problem this year. Or this today. So what we're gonna do, and here's a key question, can we just execute every instruction 16 million of them rather than a blue thread or a single thread or whatever, but can we upgrade every single one at the same time just with a different data plane? So instead of running our output of 16 million times, can we run it once? On a CPU, the answer is no. You can't. That's just not what it was designed for. In the paradigm, as we call it, multiple instruction, multiple data, or NIND, and so that's not really plausible in the way that we commonly think of our computers unless you use a GPU. Our GPU looks very similar to the CPU in that a physical die, usually on the video card of your computer, has multiple cores. That's called a core, so on the video terms, it's streaming multiple processors. However, unlike a CPU, where a given core can only run one thread at a time, or it actually can do. Given streaming multi-processors can run entire blocks of threads at the same time, which is very powerful in this particular case. And not only that, but not multiple threads, not hundreds of threads, but essentially thousands of threads, concurrently, this is only using hardware that most of you actually have computers right now that probably aren't ready for it. So using a GPU for a calculation like this, the answer can be yes. Now, for 16 million times, that's a bit of a stretch, but if you wanted to do this a thousand times in parallel, there are some pieces of hardware that can do this on the box. I'll show you some statistics as well, and how hard you can figure out how many threads that your GPU could possibly do. I'm gonna give it a time. I'm gonna say the history stuff. You can grab the slides if you're interested. These are just some examples, and I think they've never worked open the machine before. I'm sure you guys have. This particular one is interesting. It looks like a video card. This is an NVIDIA Tesla. I'm not sure exactly the model it is. It kind of looks like a C1060, but this one has a video board. It's, most of these specialized GPU cards that piece extremely high performance GPUs, they have, you access them with the same drivers that you can use on your mobile machine. However, some of these don't have video cards, so it's like a video card without the video card. So that's an NVIDIA Tesla. You can load these into servers multiple at a time, so you can, many of these streamable processors are system. Here's a, here's a chip from ETI, and their stream architecture, which is similar. Here's a standard video card. Common off the, excuse me, a common off the shelf chip set, chip with every Mac here, probably if you haven't NVIDIA chip in a Mac any fashion in the last few years, and you're not even still ever, you actually have OpenCL, one point open capability of rock, some of the chips with the driver goes into it. You don't have the in the SDK tools, everything, so you need to download all the tools, but you have the hardware and support for it. I appreciate that. So in GPU pros, the SINP architecture ability to execute a single instruction and multiple data points simultaneously you can have potential with thousands of threads, but currently sometimes synchronization which user is lined up, sometimes not, depends exactly on the type of calculation you're trying to do. Much of the time we have more floating point operations than the OCPU despite a different clock rate. And again, at the same point on the bottom, okay. So I'm gonna jump to some Java code here very quickly. Don't sweat the Java code, but I do want to show what happens for the information that comes out of the driver and the system. I don't even realize this is too small. Okay, so I have two compute devices on my machine, one of which is the GPU up here on top. The other is my CPU on the bottom. We look at the CPU to start with. Intel Core 7, i7-CPU, OpenCL 1.0, probably compute units are only quarters down on my machine, that's four, I'll skip the dimensions. Clock 3, but it's 0.8 gigahertz, this is 64-bit architecture, I can out. Well, I'm not exactly familiar with that on these, but I can out somewhere at least 1.5 gigabytes at a time. And then there's another interesting one, the default work group size, the number of threads at each core can execute at times one. And that's per compute there. Where's the GPU up here, and this is standard MacBook Pro, a GeForce GT, did the mic go off? Oh, there we go. Okay, GeForce GT330M, Nvidia OpenCL 1.0 again, it's a GPU type and a CPU type, the number of compute units is six, the number of threads per thread block that can execute all those concurrently is 512. Now, granted, the clock frequency, the clock rate, the rate at which the raw instructions execute on the chip is significantly slower, it's only 1.1 gigahertz, and it's a 32-bit chip also. So even though my CPU is 64-bit, the GPU is 32-bit, and I believe pretty much if not all the popular personal super computer GPU cards as you can grab these still today are 32-bit. It sometimes is a problem, sometimes it's not, but you just have to be cognizant of that fact. The Alec size, I mean how much memory you can consume at a given point in time is significantly lower. And on my video card, on my machine right now, I can only do 128 megabyte on Alec in theory, for some reason it doesn't seem to allow you to go up to that, but I think I can do 64, and it's like a side issue or something, I'm not sure. So those are the two compute devices that I have in my machine, I just went ahead and, and then fortunately, I didn't want to vlog a giant desktop machine on an airplane with me, so I didn't bring a Tesla C1060 with me in a machine, but this is the output capture from that. The next compute device, it's highlighted in bold, so it's a small, but it's 30, the number of threads per group size is 512, that's thousands of threads. The clock rate is a little bit higher, 1.3 gigahertz, still a 32-bit architecture, and the Alec size was a gig, gig-ish at a time. So the difference in throughput between what you have in your laptop and what you can get in all these partial cards or even $500,000 would be huge, and I wish I could have all those here with me today. Okay, so let's, I wanna get to some code. How do we actually start approaching this? So how do we use this interesting chip that we don't have on our machines? So what you probably want to do to get started is rather than, well, you're gonna need some of the native SDKs, so if you have a video machine, you're gonna need the Nvidia toolkit, but you're probably going to want to write to the abstraction layer on top of that rather than writing to the Nvidia SDK directly or writing to the ATI SDKs directly as well. And what you do that is an OpenCL. OpenCL stands for Open Computing Language. It's an abstraction on top of some sort of processing compute unit OpenCL device, device being the term, is it can be either EGQ or CQ because as long as it supports the interface, as long as the interface can be implemented on top of it, you can treat both of those as computing units. And that's why in my simple job at the query my system, it was actually doing an OpenCL query for compute devices in between two, the CPU and the GPU. You get a multiple GPUs per system, you get a multiple CPUs per system as well. Before you jump directly into the code, there's three terms that you should be aware of in OpenCL. The first one is kernel. In OpenCL, a kernel is the code that runs on the compute unit. It's just a function that would run on the GPU. This has nothing to do with your operating system kernel. It's a very poorly chosen word that we should have chosen something else. When I'm saying kernel, I'm really just referring to functions that are compiling and running on the GPU directly. A device, something that computes such as a GPU chip, not all systems are going to support OpenCL for the CPU. So if you do an OpenCL on device query on your machine, you may see a GPU, might see a CPU, might see nothing, might see everything. It's hard to say exactly. And then OpenCL, I'm not currently on our machines, but in the future it is spec'd out to support these network, these clusters of devices. And so while today we're really fairly limited in terms of what OpenCL supports and in terms of distributing to other nodes on the network, in the future we're going to be able to apparently have devices that are not necessarily vocal. So that should allow us to more easily distribute some of these large computational jobs. And then there are device specific terms, and it sometimes gets a little confusing when you're reading the literature because ATI and NVIDIA use slightly different terms and then OpenCL uses slightly different terms as well, so there are some confusing points once in a while. All right, so let's dive into some code. And we've got these nine cases. We've already been through four of them. One and two, which are Ruby 1.9, single-threaded, multi-threaded, and oddly enough, single-threaded is faster for some reason. We haven't done an OpenCL implementation. We did J3B single-threaded, which is four. We also did five. Can we do J3B with OpenCL? The answer right now is no-ish because under J3B, you can't use a lot of the native extensions. It'll complain about stuff. I'm not exactly sure why you can't do that. With J3B, you could load a Java library to connect to OpenCL directly, but if you're trying to load a Ruby library to do it, it's probably not gonna work. I try it. I can get it to work with a Ruby library. However, I can't get it to work with a Java library. So let's address number three, and then we talked a little bit about our C implementation. We haven't actually run it yet. This is what you don't want to do. And this is a Java implementation. Actually, this isn't even the code that executes an algorithm. This is just a setup work for setting up a Chrome 2 execute in a Java app using the Jocl library, J-O-C-L. If you're familiar with Java, it might kind of strike you as odd if you didn't read it. One, there's pointers everywhere. That's kind of strange. We're maligning things. We're setting kernel arcs. We're enqueuing and derange kernel, whatever that means. We're enqueuing read buffers. We're releasing memory. We're creating command queues. It's more of a loop. So this is the J-O-C-L binding in Java to access the Open-C L implementation on your operating system. But obviously, there's a learning curve for this. There isn't really a pre-library, which is just kind of a ruby way where you don't have to get into the details of what. Unfortunately, we do have a few options. I'll talk about the negatives as well. But I want to talk about the pre-way first. So if you have the code, you can bust that out on whatever tool you have. That's too small. So the first one we'll look at is the TreeRings underscore YARDS on the VIN directory. Yeah, I'm going to uncomment a method which is going to run a GPU implementation. The actual code for that. This doesn't have to be gems or anything, by the way. I just did it for this presentation. I was going to make some issues to try to pull the gem out of there. Okay, so the GPU implementation is surprisingly simple. It's relative to the Java version because we have this nice little required barracuda. Barracuda is a binding library for the NVIDIA Cuda drivers and then we can see all. Create a new program that's going to run a GPU. We find a C function. So it's just, I mean, if you've done native drivers before, and I know we have a talk later this morning about running native C code. It looks just like a C function. There are some extensions to it. For example, we have this kernel declaration here which defines to the, when it's compiled, that it's actually going to run an open CL device. It's not just a function for the CPU directly. There's a few other special flags and when we're writing kernels, there are also some special API calls we can make. This particular one here, get global ID is the open CL function to get your unique thread ID for the job. So if you were running this 512 times, this would return, or it would execute simultaneously across all threads, it would return a different value, returning a unique value for each of those threads. We have the same implementation here just in C instead. And then we have an output buffer which is actually going to need to be transferred back to CPU memory. The GPU and the CPU operate in different memory spaces. There are two completely physically different processors connected together with bus. And so when we think of computing normally and running high performance algorithms like this, most of the time we're just going to make implicitly in Ruby, but we're going to allocate some amount of system RAM. And we're going to run all our code on the CPU. The GPU has its own memory space. It has its own chip wide global space for data to be accessed and delivered to. There are also some more specific areas of memory since we have some powerful thread concepts. Each thread block of threads has its own global memory space that it can rank to as well. So you have a few more options. But it's complicated, right? Because now we have separate memory areas. Now it's not just system RAM, now we have GPU memory. But all in all, when this kernel runs, the function output is put into this output buffer and the bearer-cuda library specifically will automatically take care of copying that data from the GPU around to system memory and it giving you a nice array in Ruby. So it's really super simple. The entirety of the GPU implementation is this. And that's it. So, five minutes. So we have a receipt function and then we have, I don't like this. So since I'm a little short on time, I'm going to comment that out. And I also have this native C implementation as well. Unfortunately, you can't, if you want to try it, you're not going to be able to run it without the NVIDIA tools installed because NVIDIA has a specific compiler. All right, let's run the Ruby 1.9 version one more time. Again, this is a synthetic example, right? And I stacked it down. So I know what's going to happen on my specific machine. But the point I'll make is that despite the fact that this is just a common off-the-shelf laptop, the performance speed up for an algorithm like this can be fairly dramatic. And especially for a chip that I don't even, you don't even use on a regular basis. There's a lot of algorithms you can provide. Now, the C implementation, and this is also an interesting example. I'm going to compile this and run it right after the other. It enter. Oops. Just kidding. Rare. Recursors. Yeah. Yeah, comment it on the top. Or delete it, actually. Never declare it. There's a lot of it. So much information. No, I can't even fail. Okay, I'm going to come out the direct GPU version just to, but the point I want to make here is that even with the GPU, the native C implementation is too fast to measure. It's well, it's either too fast to measure or there's some sort of bug or it's not actually doing anything. The first time I ran this I thought, this doesn't do anything. There's some sort of, there's some sort of issue. And so I put in some print lines, just a little thing just down. We're putting in sleeves, but no, this is, this is just way faster. So sometimes, you know, sometimes we shouldn't be immediately dismissive of the C extensions. It's nice to have things in Ruby, but especially in retailing in Ruby 1.9. It's not the language. The language is great. I love the language. I think we all love the language, that's why we're here. But the interpreter out of the box is a bit on the slugger side. I don't think this is a big debate point. So all in all, I just want to post the question, you know, perhaps there's some additional tools in front of us that we're going to use. In fact, I have a few questions. The question is why is the Ruby GPU version slower than the same implementation? Right, so there are some operations that I may not have time to go over to get to these slides, but there are some operations that happen that are not free. For example, when you have to copy large amounts of memory back, and I think it was probably at least 60 days of RAM of data, of result data that has been copied back from the GPU to post memory, that has to go across a copper wire, and that's not a free operation. So if you have output that is four bytes, that's going to look like a split, but if you're copying gigabytes and gigabytes of information back and forth across the copper wire, it's going to start slowing down. The question is if you're going to really scale this out, build a massive hardware farm, is there a particular part that gives you a lot of pain for the buck? I'm familiar mostly with NVIDIA, because I have a C1060 card as a player that I'm not very familiar with, the ATIs string. In general, I've been happy with the NVIDIA cards. They are very, very fast. It's kind of cool too, because if you have a lot of of the hardware and other systems where you can run an open-ceil, you can prototype everything locally, and then just push it to the server farm and execute just so much faster on that. So I would check out the NVIDIA Tesla series, specifically, as a starting point. Made for more. The question is, had I played with the Amazon's GPU, so I think it was like a few months ago, or late last year, Amazon, they have images now where they actually have GPU hardware and that you can access via the image. I haven't heard a lot of hype about developers actually using it. The direct answer is no, I haven't played with it, mainly because the pricing is free, and when you're talking about leaving these systems up for a certain amount of time, I'm not sure I want to spend 200, I don't know exactly how much it is, but it's not the micro-image, for sure. The question is, since we probably don't care about tree arrays, what do you actually use this for? Personally, this came out of a research paper that submitted elsewhere, and it was modified to do Ruby because I'm personally very interested in GPU computing. In terms of web apps, since I think it's probably a lot of rails, so I'll appreciate this. It's probably not something you'd be directly interested in. Really, anytime you have some sort of mathematical calculation that you're doing in parallel with just different data points, that's when it's extremely valuable because the CPU is still your friend, however it just wasn't designed to literally run the same instructions simultaneously, even if it's multi-threaded and synchronized blocks or whatever, it's not literally simultaneous on a physical hardware. So the appropriate use case for something like this is usually some sort of small curl, some sort of small function, usually some sort of simple math, however, that you can run in mass and then return a simple set of data. In practice, one of the other examples that we did in Java space was complete state of integration. So given a node and edge network, where you have, say, 30 nodes, compute the probability of having any given state of all the nodes in the system, it turns out to be computationally difficult because it's two to the 30th states that you have to do. That is a great example of what these GPUs, because you can execute thousands of these probability calculations simultaneously using a GPU card. Thank you. Thank you.