 And that's harnessing other languages in Ruby. I'm supposed to need to make them better, but just because it seems like a crazy idea, there's a method to manage this. I'm working for a company called Intellectual Software in programming for a long time, but that's just because I'm old. And I've been doing Ruby since pretty much 2010. And it was mostly childhood before that. I'm a big fan of C. Our company is basically split between these two places. I'm in the Knights of Leicester. I know he wouldn't have mentioned that because I wouldn't have bragged about how far I come. And this is a map centered on San Diego and the other red dots is Cape Town. And basically the closer you get to the tip of the leaves, the closer you are to the other side of the world. So I'm pretty much close to winning. I think. I think. It's not actually the other side of the world because I don't actually check. And there was an island in Antarctica called Kurgelen Island and also known as Desolation Island. Wow. So if you want to go there just for bucket list. Wow. This thing has, you could say, been there. And I also have a theory. The background to this talk here is really the reason why I would even think of using another language. It's not because I'm part of the, if we talk about Ruby as not dead, we're playing even though San Diego is dead. I really, really came around people saying Java is dead. It's still very much alive and her deep end. So it all started in a spotter like a lot of things to do with the crowd here. There was a challenge and it was accepted. I probably sometimes think I would like to just work for a bank and not pay that huge challenge. But anyway, the challenge or the value proposition in our case was to bring meaningful insights and to just market research. The companies that we were dealing with are very enterprising and they have systems that date back to before me. Literally one of the file formats that we have to deal with is more than myself. It's one of those binary formats where you can check one method and then you know the next but it's just, it's a basically a set file format. And the only problem with this challenge was like a typical B-strap study, you've got to deliver something really, really quickly. It doesn't have to be the full product but you've got to continuously justify your existence through repeated demonstration of adding value. Particular problems with this startup or one of the technical stuff is the data is highly dimensional. And I'm meaning that just to kind of say SQL is not your top bit of spring here. If you imagine any number of columns in your requirements and there's lots of caveats, like cubes and the hood up and stuff like that, and I'm going to squirt a little bit. The data is very easily structured, comes from all over the place, all over the world. All sorts of different formats. There are no standards. And so it's kind of like the best effort thing. And finally, there needs to be real-time interrogation of the data. And I say that because high on functionality and real-time interrogation is caching strategies, operating off the disk, that's how lots of standard approaches kind of fall away. But probably like many people in this conference are initial solution given the time constraint. Rails, Twitter on Rails. This was 2010. And to be honest, the analytics stuff was not like a huge chunk of what we needed to do. We have other stuff we needed to do to be an enterprise. We needed authentication, we needed attachments. So we have things like paper, device, can-can, all that stuff. I mean, that just saved us a huge amount of time. And I'm not going to bash Rails because I don't know how we would deliver on time without it. And besides, I can always pull this quote out. Performance is a nice problem to have. It means you're growing. It means you've created your product. And that is what was basically my attitude as well. Of course, when I'm like 2010, it's kind of relevant as well because I'm going to rent labs. Those guys said the same thing. It does eventually become a problem you really have to deal with. So three years later, I was sort of looking quite desperately around the game. In a similar way, I'm like, I'm going to compare myself to Twitter. Of course, that was a big deal. So just in between 2010 and 2013, we did a lot of incremental improvements and we were doing most of the analytics in Ruby. We went from not doing that much, doing a little bit more to actually doing quite a considerable amount, probably an embarrassing amount, of real-time analytics in Ruby. We started off just playing Ruby, and we were able to raise, we used big decimal to kind of speed things up. And that sounds weird, but we need to do basically two types of operations. We need to do a lot of set theory stuff like intersections and unions. So we're going to do like a million more layer statements and then aggregations inside those. And doing that stuff with a nice thing like big decimal. It was a hack, but in one week we had something that was quite fast. And that approach is basically called vectorization. And it's not to be confused with the other terms, well, other things that vectorization can be used to describe it. In the data science as well, it means a very specific thing. And who of you were at Chris Seaton's de-optimizing Ruby talk? But not enough time there. Basically, he got a very good overview of why Ruby can be quite slow for certain types of things like the thing on the left. There's a lot of things Ruby needs to do on the L time, on the left times, right. When it needs to potentially, when the type of a tenetature needs to go from fixed number to fixed number to two big numbers. Anyway, that thing. It needs to check for monkey patching. It needs to do some things. And basically it's computing against one CPU upcode NC, which is null. So I'm null, F null. So even having one extra upcode there means you're half the speed. And the traditional approach in dynamic languages is to vectorize this, which is to say I'm going to create an object that represents a lot of things, like an array or a matrix or something like that. When I go multiply by another one of those, I can go straight down to C and do that quickly. And that also is quite cool because it looks better as well. You can look a bit more symbolic, it's how you would think, it's how you'd write the stuff down on paper. So kind of a win-win. Phase two was basically, we needed to get rid of our horrible big decimal hack because it makes me add two big decimals together. That scale of linearity of an oring was falling apart, I have no idea why. Makes no sense. We were fought and contributed and we fought from Tyler, McComber, and Ritzett, which is, there's many dangers to have these things like midfields. And that got us our set operations going really quickly. I don't really think it could be done that much more quickly. I don't know about things like pipeline and stuff like that, so maybe there's one tiny order of magnitude left. And this basically left us in a position where aggregation was kind of dominating on calculation design. And this didn't take us long to do and I kind of deferred having to really look at the performance of all the lives to concentrate on the things. Until that aggregation did become a problem, we were all this time always looking for the silver bullet and really it allows to be everything in a clean and form way. GSL was not taking the answer, but it did allow us to do electrified aggregation and statistics pretty well. It's built on a C library, it's the New Science Library, and that's built on top of other stuff like BLAS and Labac, which in turn date back to 1979 and then 1992, and I'm not even that crafty here, just because no one wants to touch them because they're perfect, doing just those tiny small things. So it's fairly solid, but again it allows to optimize a particular pathway and bias in one time. And then a certain US supermarket chain that won't be any, we benched for some work and it was the biggest day we've ever seen, it basically included every single receipt or shopping tree that had ever gone up in the last four years, all stores in the United States. And so you can imagine it's a huge amount of data. And then one of the last questions was like, I know if you buy Oreos, what type of toothpaste do you like? And then you have to talk about real-time stuff, so you can't even do the stuff in the background. So it was kind of horrible because it was like two or three, always a magnitude more than we had to be, but we decided that we need this place anyway, but we're gonna hack it for now. NRA is a gem we've written by Masahiro Tanaka, and it's kind of like a Python non-ply clone and he's the parallel gem as well. And this is gonna make people squirm, but basically we were just trying to see if it was possible. It was possible. We did manage to do some types of analyses that they wanted to do. But, roundabout, between 2012 and 2013, this chart is a logarithmic scale of our approximate, and I've kind of worked this out with a lot of assumptions, but it's approximately a linear logarithmic scale, so it's useful. But that's kind of how you want your performance line to go when you're trying to prove it. And that looks kind of good, but there's a hidden sacrifice happening between 2012 and 2013. And it's basically why we kept performance linear, our abstraction power started to flatline initially. We weren't making our programmers any more productive, which is counter to the ruby way of doing things. And the worst part was, as we were optimizing all these little code parts, we were also increasing the complexity to the point that very few people could started having people that weren't allowed to cross the roads without help, just in case we had a bus, because there were no any people that knew about it. So, we knew we had to do something. Again, there is an obvious kind of question there, isn't there a database service or some kind of thing you can't, you do this and you have to do. It may be possible that you can, we have a lot of people who look at this, and I think Apache Spark analysis is potentially something that could deal with this, but certainly at the time we don't think so, and so I may read this in our story. So, the answer is no, in this case. So, we have to solve, we chose Ruby to get the product out of the door as quickly as possible. What would happen if we didn't choose Ruby or we were to start again in another language? And this is a thought experiment. And particularly looking now at data is the most important thing that our company does. And I think a quick tour of the options. R, R is like the grand idea of open source statistics. It's got a big age of data back pretty much to 1976. It's through S, but some of the core concepts came in about 1997. And that's a really long time ago, and the stuff that's been, the ecosystem there is really big. And one thing R is really, really good at is managing data, I have to explain in terms of managing. I'm not sure where it came from, I actually tried to find out last night, but it's definitely a thing, but I don't know, it's like it's prepared. But basically, when you're dealing with, it's generally people like R or something that where you're more open to environment, not only generally working with banks or financial institutions, you're dealing with data that comes in any format with any number of considerations or standards. And you often need to put apart, reshape it from long to wide, change the damage that you're really in there to get it to merge with something else that's only usually the same. That's mungent. It's basically just having a whole bunch of tools to pull your data apart and reassemble it to answer questions, and R is the task to get that. You also have to look at the JVM, because you haven't done your due diligence in this year. Scholar, I couldn't go back to Java. I came from Java, and I tried to write one class. The other day I did a loop, and I thought, never again. So it would have to be at least a scholar. Scholar tends to focus more on how much data you can get through it, rather than what you can do with the data. That's Apache Spark, it's well done, on scholar, and other people who tend to prefer scholar. But they're generally the problem is the size, not the complexity of the analysis. And on the plus side, there's many shared ideas that you can certainly write in the S, scholar, mix and decide on analysis to trace. The sequence is not too far away from the loop. So you can write really as scholar. You can also write flow in scholar, even like Java in scholar. Pretty sure you can write anything in scholar. That's one of the favorite things about scholar is it's, it comes with a free kitchen sink. And of course, closure, I know a lot of you have probably experienced the same thing as me, which is people won't stop going on about it. But then when I was doing Java, people couldn't stop going on about Ruby. So I thought, well, maybe that's a good thing. It does have some interesting libraries for data munging. One of them is in canton. It's based loosely on R, but it's one company's attempt to make something useful for them. It's certainly not a huge project. And the rest of it's dominated by Datomic and enroll your own approach, which does seem quite pervasive in closure. Comes with a free hammock and a poster for a cheeky. And then of course, like, someone said I should look at Haskell because it comes with a free beard. I kind of wanted to because I don't think I could do that by myself. I need language help, but you only get the beard if it compiles. So the answer kind of surprised me. And this picture is relevant because I didn't expect Python. I would expect quite a few Ruby people to think like I did, which is the languages are similar enough to mean that if you know Ruby, why on earth would you look at Python? And it's more of a cultural thing than anything else. Ruby has a slight tendency to do things one way and Python idiomatic is slightly more explicit. So why Python? I mean, the first point is mainly the main one. The size of the scientific community. That was kind of a surprise. It's big. And kind of like Facebook and Twitter, there's a network effect there. Once the libraries start to accrete and libraries get built on those libraries, and those libraries in turn enable new things to be done, suddenly there's luck in. And there's value in the ecosystem, not just the library by itself. So, and there's incredible depth there now. It's basically been going on for about 10 years and where it is now is very impressive. And then of course, if there is this difference, then the similarity is a good thing. It's not that different. Bundler is better than pip install, but it's kind of the same thing. There's a lot of stuff like we were looking for vectorization and certainly like we weren't going to move everyone over to Python, but the people that would need to do any Python, it wouldn't be that difficult. So I'm going to take a quick look at what is, what Python has to offer. NumPy is kind of the bedrock of the scientific Python stack. It's an array computing library that is pretty much all about vectorization. It goes back to 1995. It's on its third rewrite. So the first two iterations were architectural considerations. We need to do this again. You can be sure the architecture is pretty solid. They are talking about a fourth rewrite, but it is a solid library. And on top of that, there's one of my favorite libraries, pandas, which used to stand for panel data analysis or something like that to do with analyzing survey data. It's now just got pandas everywhere, like actual with bamboo and stuff. So it's kind of just pandas now. It's built on NumPy, it's completely relies on it, but it ports a lot of what's good in R into Python. And what's good in R is the data frame and the series, which if any of you ever looked at, well, you'll see an example of that later. It is very, very, very fast. It kind of takes vectorization further. It gives you a higher level abstraction tools. And wherever NumPy doesn't help out, this stuff is Scythonized, which is like a Python DSL for generating C code, or it's actually written in C where even the Scython is not fast enough. So it's been around for long enough that a lot of stuff is being optimized. And it is the manger extraordinaire of data. It is hugely cynical data analysis library that can pretty much look at anything. And you get a couple of bonus extras, which we are not immediately interested in it, but it's nice to know they're there. SciPy is a linear algebra, fast Fourier transforms, clustering analysis. There's iPython notebooks, which I'm gonna skip over because it's relevant to the end of the talk. Sympy, which is if you wanna do, actually solve algebraic problems, if you wanna cheat during your high school math, this is a good library to know. Natural language talk, it's like analyzing Twitter feeds for whether people hate you or like you. And machine learning and scikit-learn. So I just wanted to reinforce the strength of the community thing, because it's particularly the scientific aspect more than anything else that I'm talking about here, not Ruby versus Python. I've got some Git commit charts here. This isn't like SuperFair, and I'm hugely grateful to NRA and GSL, they kind of, our business relied on them, but it's important to bear this in mind. That's total commits. What's even more astonishing is the contribute account for pandas, so 310 people through a combination, well, mostly pull requests, and that's a huge, huge number. If you look at issues open and closed, you can see pandas again, 4,681 closed issues. I know it's a lot of outstanding issues, I would be quite scared of 1,000 open issues, but that's just because there is a huge number of people using it, and that is quite close to Rails, and that's just for one library. And it's worth diving into those issues. Just break them down by their GitHub tags. There's 450 closed issues for time series stuff. Now, if you think dealing with time series is easy, I would just pause, and if someone says, why can't we just do this, then I just show them this chart, because this looks like pain that has been fixed by someone else. The 90, the figure on the right, that's just dealing with CSVs. CSVs are, in the data world, are sort of the simplest, most ubiquitous form of exchanging data. 90 closed issues for CSV. Pandas is just about the fastest CSV parser I know of. It's certainly orders of magnitude faster than Ruby, and it's incredibly cynical. It will handle stuff written in a Mac on System 7 in some terrible format with, you know, it just does everything, quoting, currency, date formats, missing values, all that stuff, it does it. It does it quickly. So that's awesome, but obviously I'm not gonna rewrite, well, we aren't gonna rewrite the application in Python. A, because we have a lot of Ruby programmers, and B, because we like Ruby. We just want some of this goodness. So the problem then became, can we get the flexibility of something like Pandas? With the speed of something like NumPy? With a Ruby API that feels local and natural, and as a bonus, scales horizontally, because that's increasingly becoming something we definitely need to do, is be able to farm that out to cheap Amazon instances when we need to. And for inspiration, we didn't have to look that far. The most obvious thing, which I think everyone would have had exposure to, is Active Record Sculpt. You're effectively writing SQL in Ruby. It's deferred, it's composable, but it ends up running SQL. And it's a fairly simple model to get your head around. Of course, this is a little bit different. We're trying to talk about a general purpose language going off to another general purpose language. There'd be dragons potentially. Getting Ruby to run Python is kind of the same thing as getting Python to run Ruby. It just depends on who's the boss of the API. That makes sense. And think about Active Record. Active Record speaks SQL. SQL is the boss of the API. We decided that Ruby should be the boss of the API, and that the Python side should understand Ruby. So that means transforming Ruby code into data, sending data on the wire, and transforming that data into Python. So it's a simple pipeline. And I've got Python or other because the beauty of sending Ruby over the wire is that we don't need to lock into Python. This isn't a Python love fest going on. This is, there's a practical use that we have for it, and it's completely feasible for us to utilize any language, including Haskell, so I can get my beard. So I don't know if you've heard this term. There's usually someone who knows, generally someone, everyone knows someone who's a bit of a Lisp fanatic, and they generally talk about this with a spaced out religious seal on their face. There's an XKCD about that. There's also an XKCD about lessons from Lisp, and as usual, Randall Monroe nails it. Lisp just keeps coming back to haunt us. So, code as data is related to, I mean, it's very much kind of a founding principle of Lisp, and I hope to kind of demonstrate that. There are two key lessons from Lisp that we use for building this product. The one is the use of X expressions, which I'll go into next. And the second thing is immutability. Immutability is absolutely key. I'm not gonna dwell on that that much, but you know ActiveRecord is doing it, so it is important. You can't go and, as you're chaining on stuff, you can't actually affect any of the previous, what should I call them, scopes. So, S expressions, they're super simple. They're basically their parentheses, Lisp is all about parentheses. S expressions start and end with the parentheses, parentheses, parentheses. The first argument is a function, and the other arguments are optional, and they're data. And there's one special function, the couple of special functions in Lisp, no one's quite, there's a bit of a debate whether you need three magic functions, or seven or 11, but you don't need that many, one of them is quote and unquote, and effectively, if I were to quote that function, I could say that it's data, and unquote turns it back into code, but I'm not gonna dwell on that, it's not that important. I think an example is probably best. I think a lot of people would have seen reverse Polish notation, it's quite similar concept. Function defined first, argument second, I've basically got the S expression on the left and the Ruby on the right, and the other thing about S expression, this is basically the most important thing about S expressions, and the second most important thing is that they can be nested, and it takes a little bit of time to read, if you're not familiar with that, you have to read from inner to outer, not from left to right, but the bonus is you don't have to worry about operator overloading, it's incredibly simple to create, it's incredibly simple to consume, and it's just data, so it's a tree, so we can use very, very simple algorithms to transform this stuff, and just the final example is if there was such a thing as active Lisp, that's what your active record scopes would kind of break down to, and you can see it kind of inverts the call tree on its head, you end up with the thing you do last, first, so you have to read from inside out, and so this is basically what we send over the wire. We will, we have Ruby objects that represent, well, that work pretty much like active record scopes, and we end up sending the thing on the right over the wire. The limitations, there are limitations, obviously. We don't run into that many of them, luckily, but an example of the limitation is this, this is kind of an example from our API. I'm basically here trying to calculate Z scores from some kind of column in a table. And a Z score is basically a measure of number of standard deviations from mean, and you can see the second last line, it gives you a, it's quite readable. You basically take what you're looking at, you subtract the mean, and you divide by the standard deviation, and that P map, that P map is parallel map, does exactly what you think. It runs on multiple Python backends, and we can send huge chunks of data at that, and they get farmed out to Amazon instances, and the results get returned. If I were to change this to have a little bit of Ruby stuff in that block, I've got a divisor lookup, instead of dividing by the standard deviation, I'm looking up something in a hash. This isn't gonna work because that map block is actually not being executed on the Ruby side multiple times, it's just being executed once to be turned into data to go to the Python side. So there's a bit of a cognitive problem that you have to be aware of, and it just means you have to have a powerful enough API that you don't really need to do this that often, and for us, we've been using this in production for about six, seven months. It isn't a problem. But we do get some benefits as well. Compilers do optimizations on abstract syntax trees, which are pretty much S expressions, and so can we, and we can do things like, we can automatically shard stuff that looks like that we can tell can be parallelized, even though the Ruby client is not aware of that. We can also do things like, you may ask to load a big CSV or some other data source, and you only end up using a few columns. We can back-populate that few columns all the way back to source so that we only look at that data. And that's quite cool. And the other thing I mentioned before is we can target multiple backends. So that's the kind of theory, and I should have to show you this now because it's all been abstract until now. Okay, so what you're looking at here is, I don't know how many people are familiar with IPython. Okay, not that many. It is fantastic. It blew my mind. I might do a lightning talk on it because first of all, it's not limited to Python. It's basically like Donald Kuth was being going on about literate programming and it's a real attempt at it. This is a mix of code, repl and markdown in one document that can be exported and run elsewhere. And if you think about, from an academic perspective, shipping a paper off with your data and your code and the actual paper itself, in one thing is mind-blowing. Anyway, this is, we use an IRB kernel which has been written by Mino RK and Daniel Mendler and we've also contributed to it. I'm just gonna fire this up. This is us just creating a basic data structure and this breaks down into an expression and this is effectively what we send over the wire. And you can toggle that and you can see that's what gets rewritten as it gets turned into Python. Do something a bit more advanced. This is us doing a little bit more dimensional stuff. I was gonna regionalize this for the US but I ran out of time. So this is two dimensions, country and city, it's a hierarchical data structure with two measures, rainfall and rainy days. And we need to do things like extract a dimension out of that, kind of Olapy and potentially get the mean of these things. And this is all happening over the wire on the Python side. And you can see this breaks down into, this breaks down into Ruby expression and this is what we send over the wire. And this then gets rewritten and turned into this Python. And that works, it works very well. And the advantage, another advantage we have is with all of these things we can see, we actually get runtime information. So we know the mean took 6.4 milliseconds, the group by 5.5. These are things that are not in the code above, set index, this is what happens when we rewrite stuff. We express stuff, we keep the Ruby API actually cleaner than the Python stuff we're using because we can. And then finally, error handling. This stuff could be a nightmare but it's actually really, really nice to deal with. This is kind of a similar thing, except I've tried to group by state instead of country, so that's wrong. Obviously the Ruby expression will remain the same. But if I wanna see what happened on the other side, you can see I had a fail on set index. And for some reason I don't have tooltips. But there is a stack trace that you can't see here, an invisible stack trace. And that allows us to, we can see both on the Ruby side and on the Python side where stuff went wrong. And it's, yeah, it's, this is basically our sort of IDE for working with the data we work with, with this type of code just in Ruby and actually still hosted in Rails. Yeah, and it's worked pretty well and that's it. And I didn't have to use my one slide. This was just in case. I don't know how we're doing for time. I don't know, I suspect not that well. But are there any questions? I actually don't. The project is not open source yet. We're still working on it and we're kind of pushing it through into our clients and when we're comfortable that we may look at that. But the other thing that you should look at is the IRuby notebook, because that is cool. I really think it's a hidden gem. And don't let the IPython thing, they're actually renaming their project to project Jupyter so that the Python gets removed from the word so that other people don't shiver. Anyone else?