 Thank you. I think having this on the chain here is probably good enough. Is that okay? Alright, I think this is the first time I've given a key to the dress while sitting down. It feels like we're in a living room and we're talking conversations. I think that's good. Hello everyone, I just got into Bangor last night. So far I've enjoyed it, mostly spent most of my time in traffic or in bed. And as introduced, I'm going to be talking about solving the two language problem in data science. But that's not quite right. I'm going to be talking about solving the four language problem in data science. And what four language problem is this? You may not have any idea what I'm talking about. The two language problem, the four language problem, you certainly don't know. And I'm actually thinking of a very specific problem that I had at one point. Back in 2009, I was still in grad school, I was a data scientist. Although I didn't realize it because it was 2009. And the term data scientist hadn't really taken off yet. I did a Google Trend search earlier today and I verified this in 2005. And it was very cool talking about data scientist. But I had this stack that I was doing a project. And in this one project, I was using MATLAB for numerical linear algebra that seems to be reasonable. But I was also using R for statistical analysis. So I would run some experiments with matrices and then I'd collect some data about them. But doing statistical analysis in MATLAB was just very painful. So I'd export the data in CSV and then load it into R and then do the statistical analysis and the visualization in R. Okay, so far not terrible with two systems. But then some of the stuff had to be very fast. And I couldn't get it to be fast enough in either of these other two languages. So I ended up using C for that. So there's C code in there too. And because you're loading data and saving data, you're using these CSV files and all this stuff, juggling that around in any of these languages is terrible. You don't want to write that code in C or in R or in MATLAB. So I had Ruby files to sort of handle all of that. So at some point, you step back at this and you look at it and you're like, what have I created? This is a monstrosity. I don't know if anyone here is familiar with Ruby Goldberg. He was this illustrator who was pretty famous in the U.S. They even made a stamp of honoring him at some point in the 90s. You make these contraptions where they sort of like one thing hits another, which hits another, which eventually accomplishes something useful. And that was exactly what this was like. All these pieces come together. It's actually generating R and C in MATLAB code from the inside of Ruby. So from my perspective, this very much was my motivation for Julia. I was like, we have to be able to do something better. I was hanging out with her all one day and we're planning about this. And he said, you know what? There's this guy, Jeff, that you absolutely have to talk to. We've been talking about this for years. And we think we can't do better and we shouldn't do better. So three of us started talking on email about that and decided to try to do something. And Julia came back. I have a confession to make, which is I've actually done worse than this. I've done up to six or seven programming languages in one project. And there's always some justification for every single one. But it's really, it's a terrible convention to have to do. So what is the two language problem in the first place? This is a term that people may not be familiar with. There's this other thing that may be more familiar, it's called Usterholtz dichotomy. It's named after this guy, John Usterholtz, who is actually the creator of Tickle. So, you know, we can blame him for that. But he observed that there was this sort of division between programming languages that seemed to be happening. I think he observed this at some point in the 80s. And he noticed that there are systems languages and there are scripted languages. And the world seems to be divided this way. And systems languages are static, scripting languages are dynamic. The systems languages are compiled, scripting languages are interpreted. System languages let you define, the user defines their own types. The scripting languages have standard types. They just have sort of arrays and strings and dicts. That's it. You don't define new types. Maybe you can define classes, but I think even back then the idea of having classes and scripting languages hadn't really become much of a thing. And systems languages programs tend to be pretty standard ones. You create an executable, you run it, it does its thing, Tickle the foot produces output. Scripting languages are often used as glue. They connect lots of other pieces together. And so this, you know, at the time was a pretty queer divide over a lot of examples of this. And it seemed like there were just these two different worlds. And so, so why is this a problem, right? This is an observation. It's not really a problem. And so the problem comes in that because of this dichotomy we end up typically doing a two-tier compromise. And so that is for convenience because writing in these systems languages is usually a little bit difficult. There's a lot of extra typing. They're kind of for the most. So we use a high-level dynamic language that falls very much into that scripting language category. It has all of those properties, it's dynamic, it's interpreted, et cetera. We use that for convenience, but then we end up doing all of the hard stuff in the systems language. So all of the actual work gets enough in C or C++ or Fortran. And then all you really do is use this high-level language like Matlab or R or Python to tie it all together. And this is actually pretty good. It works well. I've written many systems where, you know, I write the part that has to be fast at C and I make it a Ruby extension or something like that. And then I actually play around with it and build all the high-level models. It's a pretty good design. It's actually really practical. So what's the problem? It has some issues. So the first one is it's the hard parts that you would really benefit from having an easier language for it. So if all you're doing is the easy stuff and the easy language, that's fine. If all you have to do is the easy stuff. But when you're the person who actually is implementing whatever numerical algorithm or data analysis or whatever, if you have to do it in C because it has to be fast, then you're not benefiting from this convenience. You're not benefiting from the ease of use that this other language is supposed to do to you. So it's really only the end-users who should get the benefit, which is unfortunate. It tends to force vectorization if you want performance. So if you use a math lab or ARCO that is familiar with, you know, don't use a forward loop. But not forbid you use a forward loop. This is a terrible idea because it won't be very slow. And that can sometimes be very nice and convenient to write these very technical online expressions that can do a lot of stuff for our presentation. But sometimes it's awkward. Sometimes you really wish you could just use a forward loop. Other times it's wasteful because you end up allocating a lot of memories for temporary arrays. This I didn't realize at first. When we started this project, I had no idea what a big deal this was going to be. This separation of two levels creates a real social barrier. It makes a big wall between users and developers. So the developers live in the low level language and the users live in the island language. If you're a user and you hit, you know, you're trying to figure out what's going on, maybe there's a bug, maybe you're just confused about how something works. You can step through, you know, let's say some AdLib code or some R code or NightLib code. Then as soon as you hit, well, if you hit the built-in wall in AdLib, you're stuck because you don't get to see the AdLib code in the proprietary. But, you know, in R and NightLib, you could go look at the C code. I program the C all the time, so in theory I could do that. Once you hit that wall, once you hit the C line where you're just like, it's too much physical. So to take away that barrier, you could tear down the difference between the user and developer and then your user's automatic would kind of become developers all the time. So, this is that dichotomy again. And so what Julia lets us do is it lets us tear it down. That was supposed to be a bit... That was very upsetting. But yeah, so Julia takes, you know, it's dynamic and it's compiled, which is one item from one column and one item from the other. It lets you, users, define their own type so it also provides you with all the you can standard types, like arrays, it's a nice string type. You can produce standalone programs or use it as a little bit. Another way to look at this also that I've seen a lot of business is that it puts this in a unique position in terms of speed versus productivity. Right, so this graph is sort of an aggregate over some benchmarks where the score of a language on a benchmark is what it's time, how long it takes for a relative to C assuming that C is about as big as it gets. And you can see that C is on here. It's on this list of languages and it's at exactly one. So C is as good as itself in terms of performance. And then normalize lines of code, I think that's how many lines of code normalize so that they, so that it's I don't remember how the normalization is done, but so that this number kind of makes some sense. Essentially what it says is that the smallest number of lines of code for benchmark 1 is the most number of lines of code for benchmark. And so this is a little weird because JavaScript looks really verbose here and this is actually because in JavaScript we had to implement our own matrix multiple times. So it's a little bit of a cheat. JavaScript isn't really that verbose. But you can see that Fortran is up there. It's very far to the right. C is very far to the right. Java is slightly less far to the right because you tend to write a lot of code. And then in terms of in terms of being very concise you have other things like let's see Lua, which is pretty far left. Oh no, that's R. R is actually that purple thing over there. But you know, so some of these high level languages, so essentially we have here these are the systems languages and these are scripting languages. And what you want is you want to be in the corner. You'd like to be in the corner between both of those. And that's exactly where Julia is. So I think that's really satisfying. That's kind of the old, old bit that you can have a big need to do. Okay, so let's look at that Rube Goldberg data science stack. So today I use Julia for numerical linear algebra instead of MATLAB. I use Julia for statistical analysis instead of R. I use Julia for the fast stuff too. And I use Julia to tie it all together. So now we've eliminated this four language problem. A lot of this you could also be done with Python. So Python has been growing very popular in the game analysis. But one thing you can't really get is you can't get rid of the C for the fast stuff. But you can't get it down to a one language problem with Python. There are systems like Scython, but I think that's a little bit cheating because Scython is not really Python. It's Python. Like, that Python and C had a weird love child. And I don't really like it. Some people really like it. I find it very confusing every time I go there. But you know, it teaches on. I like having one language that is truly just one language that I can do all of these things with. So, people are going to talk about some statistical analysis, some visualization, some machine learning and numerical linear algebra later on today. So I'm going to focus on this last item. Like how Julia lets you tie all these things together. Halifax is sort of a Ruby or wave file with icon replacement. So what does it mean to tie things together? Well, so one of the natural things you have to do is read, parse, and write files. That's, you know, you can't look without that. To get to the computation you need to have done all of this part first. Just to get there and you have to write it out again. Running external commands, this is huge. You know, sometimes you just have, you want to call a units command. All you want to do is call a network sort or something like that. You need to do networking. You need to pull cool stuff off of the network and you need to also be able to serve things up. And you also want to be able to call on a program you like. So, just by the way of example, you would not generally, if you're a same person, consider writing a web server or an R. I believe people have actually written that web server as an R. That seems crazy. In Python it's totally reasonable. So that again is another place. Python is a general purpose language. It just sort of has the numerical stuff wrapped and onto it. It's a very different from these other numerically oriented languages. Okay, so now let's see some code. Let's see. Okay, so we started Julia. Now, let's see. The first thing we might want to do. So there's an anti-download command. I have this in my history. This guy, Jared Lander was actually a friend of mine. He was in New York. He had a lot of our work. So now he's a friend of mine. He even helped me run the Julia that was the unit. He's very good. But he still does all this stuff on it. So we can download this housing data. It's too slow. I have a copy. So that's pretty quick. And so we can see I just switched into a shell mode. I don't know. People may or may not be familiar with this, but if you hit semicolon at the beginning of the line, you get a shell mode and you can just make shell hit. So we can type, look at the top of this. And what we can see is that this is actually a it's a CSV file with a lot of data about housing prices and various other things in New York. So let's load it. So there's this read CSV command. There's a CSV. So we get it as an any array. It's just got a bunch of junk in it. All strings. We'll look at some of these. So those are all integers, but it's actually as a big and any type of array that includes that first thing. I really like these to just be straight up integers. This is kind of awful. So, well, it turns out there's a whole package for dealing with this kind of data frame. You'll note that loading that, if you were on version 0.3, it came pretty slow on 0.4. So like so nice. This is because we can pre-compile cache packages that we've already looked at before, which is a really huge back glossage. So instead of doing read CSV, we just change this to retable. It's a little nicer. So we can do h, and we see that that's actually a data array. I mean it's going to be much more stuff with that. It's a lot easier. I'm not going to spend time actually analyzing these data centers. I just wanted to show how you can download them. So, let's say I wanted to I kind of thought this was fun. Here's an example where it shows the kind of thing that you can do. You can do this in Python, you can do it in Ruby, you can do it in Perl. I actually think it's nicer and easier to do this type of thing in Julia than in any of these other languages. So here's an example that's a little fancier than I wanted to show you. What this shows you is that you can you can open a command. So in Julia, let me start with the other thing. So if you want to run an external command, you put in the backticks. It's pretty straightforward. So in other programming languages, the backticks will just actually execute the command. And I had a friend who I was working with who writes a lot of Ruby code. One of the most productive programmers I know. It's probably crazy because if you put everything in backticks, what it actually means is take the output of this Ruby and take the output of this script and save it into a variable. You can just use this to run stuff. And do it everywhere. I had like sort of, you know, Twitch a little every time I saw this code, but it kind of, it didn't matter. It worked. It worked until it didn't work. There was like some subtle bug because he was like splicing a variable when things are bothering with space in it and it bothered me. I would get that little Twitch every time I saw something that I knew was technically incorrect, but I was also like, why is he wrong? He's not wrong. A programming language should actually make what he wants to do right. Instead of, you know, giving him a hard time and just being like, no, no, no, don't do it that way, but do it this other way that's much more difficult. You know, I said, okay, I was at the time developing Ruby on the side. I was one of these crazy people who has no programming language and nobody had ever heard of it. And I thought, he's, the way he used the back text it really inspired me to try to do this better than he did it. So let me show you some of the examples. So, um, so let's, oh, let's, you know, take the the first three lines of that out of the cloud. So we have the command and we actually just want to run that. Okay, so we get it. That's cool. Um, now let's say we wanted to actually cut out a couple of columns using, just using, using Unix tools. Um, we can do a pipeline. So there you go. So, one comma two, maybe three, four. Okay, so now we're using Unix tools to cut out a little tiny slice of this there. So what's going on here, is like, is, is constructing a pipeline between these different things. Um, the same way that you can use a pipe, pipe character in Unix. Um, and then run, run is the resultant thing. And you can see that if we don't run it, we just get an object that represents that pipeline. Um, and it, since it's standing out to me this other file, we can run it. Uh, now the key, the reason I didn't want this cube to run the thing is because a lot of times what you want to do is something else. You might want to do something like, we did, instead of printing the data, I want to capture it in the screen. So the idea was you have this object that represents the thing you want to do, and then you're actually going to leave from it. Um, so of course, you know, we don't typically have a fixed file that you want to read. You have some, you know, file name that you want to slice in there. That's not going to work with this file in the pipeline, but let's say 1000, not CSV. I know that works. It does what we want. Um, but, okay, let's say we move that temp file. Now, we can see that in the terminal, if we cap this with the appropriate courting, we do get the file. But, this obviously isn't going to work. It's not a file in this, so let's try changing this. This works. Why does this work? This is a really good question. This would totally break if we were doing it in the reviewer. And the reason is because if you take a look at what happened to this object, when we actually interpreted it, it quotes it for you. What is going on? Is this magic? So what's actually happening is that you, this, this back tax syntax, the easiest way to think about it is it's actually a very strange or way of doing syntax. It's actually syntax for array literals. So, if you take this thing, it actually has a field called exec, which is the array of things that actually is going to be called by the pass to the exact system called. And what that means is that we're actually never calling a shell. We don't do the whole thing where you pass them off to a shell which does all of the work and actually you don't want to. Because the problem is then you have to sort of work around all the things that the shell makes it special. So instead the mantra here is don't use a shell, be a shell. So the idea is that Julia implements all of the things that your shell implements with sort of similar syntax and that way it knows about things like well, okay, file is actually it actually just takes the file object that puts it in there as an argument because it knows I was talking about how this back text syntax is really sort of like a strain array syntax. And that might make sense, right? Because shell commands are the shell's literature thing on spaces it's an array of arguments. It's exactly what the shell is all about. It's about figuring out what the array of things to call effect on is that pretty much all the shell does what it calls exactly and do a bunch of other system calls in particular in order and to have any syntax to that. So we do the same thing. So let's say I want to define multiple files so to have all of those in here well if you just type files in you get exactly what you want and you can see that what this does is it actually it knows because it's in the programming. So the reason the shell splits on spaces is it has no array name structure. Now the shells do back when shells were first invented they just had strength with all of that. And so you're like okay well how do we simulate an array? Split on spaces. First it solves the problem. So for then you have arguments that have spaces and then you know it becomes a rabbit hole which it disappeared down. Well so we're not in that situation. We're in a real programming language with as arrays and as strings and strings can't have spaces in them. So why would we collapse something with a spruary string only to then have a problem with figuring out how to turn it back into an array? So instead we just take the array and pass it all over to it and that's exactly what happens here. There's some nice ideas that the shell gives you. Like for example if you have you know you can do things in an array you know who, who, what as and you know the curly array syntax and it will expand into three different versions. But we do a very similar thing if you take also the file it works as you would expect to just events txt to that one argument. But with files it actually does that to each of them. So that's kind of nice. It goes through. You can also, you can expand with like a little thing here. So txt dxt you can save yourself a bunch of type there. You have all the file combinations and they're correct reports. We'll talk about that. This correct quoting thing seems minimal but it actually is the difference between it being kind of fragile and unreliable to call experts brands and being completely reliable and a robust thing which you don't worry about. So a good example of this. So let's say my data is the string backslash r backslash t backslash n backslash b backslash a backslash 0 1 x 0 1 So this is an awful full of escape characters bit of data. If you wanted to try to pass this as an argument to echo for example or some other command total library in the shell I don't even know how to do this. It is probably doable but you probably have a sign so I don't know. I've tried it before and here to try it okay well there's some horrible string in there that you see the key is that underneath it all it actually just has that string so it knows exactly it has at no point tried to interpret anything and so what that means is that as soon as we call read all on this so the read all runs the command and it collects all of the data and gives you a string. You get the exactest thing back So is this equal to data? Yeah it's exactly data. And in fact there's only one exception any binary crazy binary data you want to pass through this way it will go through. And the one exception is so let's put a so now what happens is this actually used to just be able to just fail but now we're more careful about how we pass things to see and what actually happens here is that it detects that the string has an embedded null and if you try to pass that to the spawn process it will because the process actually when you write a C program you get the arguments and the way they're all terminated so you couldn't possibly get any data so now we actually catch that and warn you about it so you can't do that but at least you get the proper warnings that are just unexpected data so the example I wanted to show was this so what this does is it starts the sort command in write mode so that says I would like to write data to the sort command and then it says this do syntax is a way of passing a complex function basically in an argument with IO and then what I'd like to do is I'm going to print in Julia code a random permutation of the numbers 1 through 10 and then that's going to go to the sort command and we can well let's use the print to the cat command so we can see that it just comes out exactly in such a random order but if we sort them it'll come out in a particular order not quite the order we wanted because sorts things as strings so 10 coming out for 1 but we can also tell sort to do numerical sorting in the right order we can do reverse it that'll also work reverse it so anyway the bottom line is just that Julia is writing this sort of thing I feel like I should show writing a very simple so mox is a web server it's pretty basic here is something I ran which basically sets up a test app using mox that does some very simple things it sets up for a group just to respond to the hello world it sets up an about page which either responds about me or who and then it sets up a bunch of routes to users where it just says hello and then you read your name and it's got four or more otherwise and then you just just see what this guy is test is a mox app there's some anonymous function which is what it actually does I think it's sort of this okay so it's listening on local host I can't make the URL bar much bigger but you can see that it's thinking and it does hello world and you also can see that it was running some code in any capital warning so we don't have to do it separate it's silent otherwise if we go to the about page we want to app 110.5 we go through I almost spelled it wrong it's in my bound you can see user 123 we don't do any validation here this can be a string it doesn't have to be covered in one there's no database getting into it it's easy to do this this is not I would say at this point if you're wanting to write a full blown web app you do not have the infrastructure for that you better off using something like Django and maybe offloading calling julia to do the compute part but I think that might not be the case for too much longer putting together should be possible I think it probably doesn't ever need to be as complicated as we know let's try actually making a request that's the other thing obviously so we have a web server we can actually do one client as well so we can do requests GDP close to 8000 so we got this request object it's kind of opaque it doesn't tell us much but if we if we look at the data it's got some data which is just bytes that's actually not very helpful we can always wrap some bytes in a utf-8 string object and then they'll actually print it as a string and we can see that that is a low load and that's what we're serving to the full web app so some interesting things I am going to try I'm actually going to try just so it's actually interesting it has some input requests listed on the app so that'll take a little bit of local stuff I was hoping to show some sort of scaling but it's not going to work very well if you time this and then you can do for example let's say so we can see that that is about 10 times slower than you're doing one yeah that's more it's not going to be that much slower but if you can for example use an async IO so the way IO works in Julia is so what we want is we want the usability of writing blocking IO without the problems without the so essentially what we do is we wrap with the blocking API but every time you do a blocking call it actually just tells the scheduler let this make this task stop and not do anything until it has something that it's supposed to do and then we allow you to have many tasks so for example if you're a regular web server that you want to serve lots of people that have tasks and not block them or fail to serve all the others all you do is you create you create a task for each of those each of those connections and you let them each do their own independent thing and you can, there's pretty easy macros to let this work with this sort of thing so the sync macro makes sure that a bunch of different tasks that are all executed can wait for each other and connect back at the end and the async macro says do this thing async modestly in its own thread so what this is doing now is it's going to spawn a new task for each request and they don't block each other and they don't have to have it in sequence like it has it currently and then before the for loop exits wait for all of them to finish so that is a bit faster because it's a long looping host when I was doing this on a dual remote machine it got much better Stanley where it was like 10 times 10 times faster than we would in parallel but in sequence so this if anybody is familiar with so this is, these tasks are also known as co-routines this is the same model that other like Hermione has this model though it has a pretty similar model though it actually takes it one step further as separate, as threads behind this we're actually going to move to that model in future we'll work on experimental writing as a future in the language so let's take some questions this is one big question on the previous one in the previous three you had colon and then the variable name yes part of the language or is this part of that particular you mean in no, not that oh yes yes colon user colon parameters so here that is actually just so parts of it are and parts of it aren't so this colon user here is just part of the string that's just normal text this here is an expression interpolated into a string so between these parentheses this is actual Julia code and these are the colon parameters is a symbol so it's like in Ruby symbols are like this too so parameters is a symbol you can see the type of it is symbol it's sort of conventional to use that as keys if you just have them that's amazing so it's very much like Ruby if you're familiar with this there should be parameters so so if I have a multiple concurrent processes on threads I wouldn't need concurrents so if you call them tasks not threads specifically you can express that they're concurrent but they won't be parallel so they won't run at the same time so only one will be running at a particular time but I would like you to know how does it work but how does that compare to a thread on Linux or which line it's having so on the side a C thread so C threads can actually work at the same time which is totally bad you have to worry about them having dead having race conditions various other issues but that is what we're working on having thread support at this point the way Go sort of deals with the fact that you might have race conditions is that they encourage you you can actually write to the same musical memory but they encourage you to communicate between threads and channels there is actually a channel abstraction so that's what you need to do now also for concurrent tasks you can communicate between tasks with your channels you can actually also communicate across different machines with channels which is something Go which is not what you do so once you move to the threading model would concurrent tasks be lighter than threads and would that be faster or so the model that has been to everybody's perspective or Sean as though somebody modeled for getting the best performance out of threading is that you have only about as many physical threads the kernel threads actually have physical cores there's a little bit swapping you know the kernel threads if you're having 10,000 kernel threads and you only have 22 cores it's just a lot of overhead and a lot of time swapping anything and you have the kernel very slow so what you'd like to do is you want to express your work in terms of small tasks so units of concurrency and then let the scheduler decide what thread to execute the model and it has a pool of threads that is approximately on the order of the number of physical cores to have and now we can get the best performance that's exactly what Go does that's what Go does it's generally the best way so you're saying that the tasks are not because we only have all the Julia code executing one thread yeah we're definitely not as we're not as scalable as we're relying yet it's a practice space I think quite a very direct indication of tasks that might be slower than that if that one core is not better it's static on the top it can be how many threads it's just a shushy image you got a million tasks also very migrate it's an intimate entity but we can actually do if you try making million tasks it works okay awesome I don't personally have that much cost to create should I write some code that does this yeah I think sweet yeah that's a good idea so let's try like 10,000 okay so 10,000 tasks it only took one second so let's go for a million let's do it not terrible I mean it did it worked this is actually a pretty terrible work of my own it takes like 10% of my battery to recompile Julia and run all the tests it's like a one-week error or something yeah it's pretty slow any other questions?