 Video equipment rental cost paid for by peep code screencasts Well, this is scientific computing with Ruby and the GSA Which is what I told them I've talked about and then I decided to name the GSA something that's Fits better in the in the Google space. So Tigu you're not going to find except for if you find a lizard or me So that's the reason I changed that so Tigu is just a lizard out of South America And I actually have a friend of mine actually did that logo for me. So we have a logo too Anyway, this is me. I'm David Richards. I wrote Tigu. I've been writing software for about a dozen years or so I decided I was Unhappy so I went back to school again this time. I'm studying something called system science which is Systems so we do math computers a lot of machine learning Try to understand complex systems it's a PhD program and I get a hang out with cool people and smart people so I like it and they teach me a lot So a friend of mine said I should use this and God we trust all others must bring data and I find and I've got a lot of reason to Figure out my data and make it useful and integrate with what I'm doing with what's going on out in the outside world and Of course, I want to use Ruby and I found a lot of great bindings a lot of great tools And I was having fun with some projects and I decided Well, what what do you call it last night? It's the F word. I'll write a framework so So what I'm building is a large workbench for complex systems and it's generic in general And it should Adjust to what you're doing if you need data if you need to think about data if you are working in a production environment or you have a one-off Solution you're looking for this might be a place to go to look for a way to work on things So for instance If you were doing the the Netflix competition you want that million dollars How would you approach it? Maybe you thought about it. Maybe you haven't You know you sit down and it's a complex problem. We've been working on it for a while or people have been working on for a while I'm registered and I've got the data and I haven't done a thing with it yet But you know, it's a complex problem You're trying what the what the idea with Netflix is they said we'll pay a million dollars to anybody that can improve our Recommendation system by 10% and so they're recommending to people, you know, what's what movies to rent? so this is basically a value-add for the customers you go to Netflix instead of borders because they understand me and So to them it was worth the million dollars They had rewritten their engine and they got a 10% improvement over their last engine and they decided to go ahead and open that Up to the community So a lot of people have signed up to try to work on that problem It's very complex that the winning team right now is two computer scientists and a statistician out of Bell Labs And they at last count had a hundred and seven models that they were bringing together And they're all kind of there's four or five different approaches. They're taking on their modeling But then they're combining them in interesting ways so they can get a you know better performance But how would you do that? You know in Ruby say, you know, how would you combine the models? How would you keep track of what you're working on? How would you do it without doing one-off scripts all the time? And so therefore Tegu is invented to hopefully work with large databases in the terabytes and above To be able to do complex analysis Integrated to real to a real infrastructure In other words, you don't have to do all your transformations before you get to work You can just get to work and then do your transformations in Ruby And then hopefully in human time before you retire You can get your complex analysis complete and then then with the resources on hand at times That's kind of the general framework of What we'd like to work on Some more concrete examples of things I've been working on and and colleagues been working on is There's a lady in Portland Who? She's doing just genomic. She's she's she's studying Diabetes, she's doing she's doing Trying to figure out What causes what's what are the genetic causes for diabetes and and parts of diabetes? And she had some very large data sets and she had come to our program She did some postdoc work just to study a specific method that would help her simplify our math models We took class just on that it's kind of a neat little thing And she finished the whole class. She said this is great But our data set is way too large for the libraries available And so I said well I've been doing this other and she gave me some of her requirements and that's been worked in somehow She's going to be able to work with with her data set So the problem space is we need to be flexible We need to deal with cost or at least be aware of cost We have the resources of the cost we can integrate it with other things and then scale it hopefully to the size of your problem and There are some great solutions out there Does anybody know about our language? Maybe everybody our language is a language for statistics It's great. I love it. I write a lot of things some things. I've got some code in R that I prefer R It was a great job some of the best statisticians in the world are writing to R So if it's statistical in nature Most likely the best minds have already solved your problem and it's going to be there and you can include the libraries and it's great Matlab, Mathematica, Maple, Octave some incredible solutions They're very flexible. Most of them have languages integrated They work well they scale sometimes cost is a bit of a problem that some of these are commercial a Lot of your engineering labs will use Matlab as the as the default Mathematica is a lot of fun. It was written by Wolfram I was gonna say Wolfram. I knew that wasn't right. Yeah, he's the guy that was I think the book's called automata. It's a great big fat book and he talks about how cellular automata works So we need guy and he did Mathematica Yes, yep that one as well So, yeah, he's he's behind Mathematica and Excellent excellent resource So there's that I like WECA maybe WECA fits the problem space better At least the problem spaces. I'm thinking about it better It is you know with with J Ruby you can just bring WECA right in and you're good And so as long as they have what you want. That's that's a great solution Mikhail Brishnikov a lot of people say he's the best dancer and he says I really reject that kind of comparison That says oh, he's the best. This is the second best. There's no such thing And I'm thinking there is no such thing. This is a very complex space to work in What do I need and then see if you guys need that or would want that to So my basic idea is that we work in an ecosystem of data and ideas a Lot of things are coming from a lot of directions I wanted to have a workflow to provide a framework that gives me the workflow That I can bring in basically any algorithm and work on that Without too much refactoring or any refactoring and any library If I combine it to Ruby or find a way to get it to talk to what I'm doing I'd like to use it if it's better. I don't want to reinvent any wills and Hopefully we're at a point at least in my noodling that home on the cluster at home I'm able to get to a lot of pretty neat things with Tegu so That's that and then to bring it down into a Some sort of environment what I'm looking at right now is definitely Rinda will be Will be there and then Hadoop is the other one by by November is the idea Rinda you guys heard about it here They It's Ruby centric. It's it's the Ruby version of Linda which came out of yell Which is your your parallel processing a great simple direct Approach to do things Hadoop is a map reduce environment So Google came out in 2005. They said this is how we're managing all of our major data problems We're we're reducing we're we're we're defining a map function They will do something like counts lines on them in a file and then I we partitioned the problem out in Into a 2000 node cluster and then we run everything in parallel and then the reduce function says this is how I combine The work that happened and the thing about the map reduce framework is that it's It's straightforward and simple. You don't have to really have a background in in Distributed computing or or in functional programming to really understand, you know, how to do something basic at least But the problem with a map reduce framework is a lot of the older libraries They're not thinking in those terms you don't have a map and reduce It's not broken that way you're gonna have to do it and some of the libraries they get they're really slick You don't have to think much. You just send in your data. Make sure the parameter set bring it back out So it would be a bit of an effort to do some of that work in that way So it kind of depends on your on your data set in your problem space what you're gonna use I think I've architected things in a way that we can go in other directions, too And that's kind of exciting. It's it's Well that comes back to in the beginning, you know, hopefully with the resources you have we can fit something in there and do something I should say I I think it maybe it's obvious to everybody you would assume this is a MIT license This is open source. This isn't this isn't anything commercial So we I say we is in I'm willing to help. I'm willing to give you time and work with you guys Not as in I'm trying to sell you my services Anyway, and then and then the other exciting thing I've been I've been following the Hadoop List and there's some ideas that came out of here and for about six months I've been thinking about how best to to bring this on to EC2 and the Amazon Web Services and other things and The idea is that by November I'd like to I'd like to have that solid by about November my life gets really busy So things get to be finished But a lot of what you deal with once you're set up is just your workflow And I think this offers a lot of flexibility and it's a generic way to deal with with your problem You start with a job or what I'm calling a job. It's class that that can handle a job. That's what it is and it's an algorithm and And you'll write a directive will see examples in a minute here You'll write a directive and basically say here's Well, we'll go into that in a second. That's basically the last slides Here's the directive. You'll pass it off to the workload manager The workload manager says I don't know what you're talking about He'll come back and say and you can just work in the console or you can write up scripts or or whatever I don't know what you're talking about. Give me more and the idea is let me go to the next slide for a second This is maybe the best one of the better ideas here I'll come back to that last slide. I apologize. I'm a little bit fuzzy of my thinking here But the thinking on the ontology is this There is no one best algorithm for anything and you don't know them all and if you're doing analysis You're doing analysis on what you know what you're comfortable with so again, if you're doing some large problems say the The Netflix competition you're going to go to what you know, which is what everybody did and without the collaboration of What are better algorithms or how could this work or how does this compare to a different algorithm? There is no learning taking place except for did my code work and can I come up? Can I come up with a better idea? But that's not the idea behind Tegu The idea behind Tegu is if you like and you're using this and you have an algorithm That's new or you came out of the the academic literature or something that you think Could be useful that isn't in the libraries we have so far submit it It'll go to Tegu hub. That's me. I Have the domain and the server ready and there's work to be done But but then I'll keep a central repository. I'll test your code I'll run it against some some standard data sets and publish the benchmarks where the wiki is going to be so you can basically have an idea of a good idea of what a lot of different algorithms can do and a lot of things that you don't Necessarily have to start in the academic literature to understand you can go in a normal Lay person terms read up. This is what it does and then if you need more information Then there's going to be you can either get citations or Google it and go from there or just run it and see how well Is it doing your data? But you don't do this manually either the workload manager Goes out and he says Am I up to date on all my job signatures? Do I know? Do I know? All the ways I could solve this Yes, I am okay good now. Let's look at what he gave me maybe I asked for something and I gave it the specific name of the algorithm That's easy. He basically says I've got the name. Can I run this data against that name? No, do I have a transformation? Algorithm that'll go from this data to that algorithm. Yes, you know It'll just basically tree out and figure out can I solve the problem? Yes or no Maybe I don't know which specific piece of code I want to run Maybe all I know is that the method is something like a artificial neural network or a simulated annealing or or some sort of other thing Or maybe the function is all you know You just want to classify this in terms of I need this classified data or I need to predict this or I need to Search this there's just some a few general functions that you can do and then it'll it'll suggest methods Also data types and these it's weird because we're bringing this out of the data mining world where they're not talking in terms of a Programming language. They're talking in terms of a general idea So it's a little weird in ruby to talk about data types because we think about strings and floats and integers but what they're talking about is it's it's a it's a Tabular data or it's a graph or it's a state machine something like that Something that you wouldn't expect an algorithm to be able to handle just you know It's it's coming in some specialized format But we can define whatever we want on on these You come down to the algorithm point and then you basically say alright go and here's any constraints You might need to run this algorithm a lot of them need some extra parameters It could take other models you developed in in Tegu. It could take hard numbers You could write an equation right there with Ruby if you needed it I mean you can kind of work with it and play with it It's kind of malleable and then finally you'll know at least the benchmarks and the workload manager knows the benchmarks because what it does You see if I have a slide next I don't What it does is it says? What am I optimizing for your default is you're going to optimize for popular at least that's what I'm going to do Whoever's using these algorithms most use those but you can optimize for server time calendar time trusted sources Try to think which other ones I've set up so you basically you can optimize for various things and You know if you've got say ten dollars you want to spend on on Amazon this this afternoon You know you can you can constrain it to what is that ten? Ten cents an hour, you know you use you set up how many hours server hours you want to you want to spend and and it'll It'll do whatever it can in those in that constraint. There's also an idea in there where it says You've got your execution time to give me an answer. I'm impatient. I'll wait two minutes or I'll wait 20 minutes Or I'll wait two years, but I've got only so much time I'm willing to wait to get an answer back, but then I have a post execution time too and that's exciting because The workload manager is actually a temporal difference algorithm and he's going out to all the different states And he's basically saying I can run in the best the best way that I know how to give you your work is this way But if I had these plugins set up on all my nodes, I could have given you a better answer a more accurate answer or whatever So in the post production time it comes out and says well, then I'm going to go do this stuff You know if you give me a budget of 20 minutes, I'll go spend that 20 minutes or two hours doing the work that that should have been done that would have made this a better run and And So kind of ambitious and big, but it's fun Well, I said I was going to come back. Let me come back here and make sure I'm explaining how things work How your lives would work if you're using Tegu? So then you come in here's the workload. It'll it'll generate a model for you, which will have a result set and data and And then you can just either take the data. That's the answer. You've got the answers go or you take the take the model Actually, I did that wrong. You get a model and you get a result set I did this with on very little sleep You have a model and a result set are the two things that come out of it And you can either run with a model and reuse that in production or in other data Say you start with your training data and you move into a new untested set of data Or you just take the actual numbers and that's what that's all you need it So anyway, that's kind of the idea the plugins I've explained So the idea the whole thing is not We're trying to learn here as individuals as a community Tegu the the workload manager is going to learn and so we're trying to change the state We're trying to change the system that can produce more or less a permanent change in its capacity for adapting to its environment I wrote Simon wrote that he's he's one of the the big boys in the system science world. He's a smart guy But you know, that's that's kind of you know if we think about that, you know The workload manager knew better once you run a job. It knows better how to run the job next time Once you've worked with with some data, you know better how to how to work with it the next time And a lot of your learning is actually going to be tied back to the lab book So we're not just dumping a bunch of things in logs, but we're we're keeping track of I'm keeping track of This is the run. This is how long it took This is what you did. This is what you didn't do that fit your your set This is the information on the wiki at this moment. I store it right now Transactionally I just keeping track of the data. It's going dimensional. Basically, it's a Data warehouse where you're gonna be able to say show me my errors and just work on this show me The things I didn't do and you'll be able to parse it and look through it in a console like environment The this winter I'm gonna get to that in a minute this winter. I'm gonna work on a module called human elements Hopefully the the the the core the algorithm core will be solid and used in different people and people that are Technically comfortable will be using that to get solutions the human elements core will be a gooey front-end written in flex that's gonna give there's about five or six important tools that will basically take this to the academic world and to the business world Without needing to write ruby In the case that we have what they need right And and part of the human elements will be at the lab book. You'll have a good a good graphical approach to this so so what's exciting about that I think is that basically I've got too much data in my space. I have too much to know and I need to be able to focus in and laser in on just those things that affect my results affect what I would Publish or recommend to people and so anyway, that's what the lab books all about So remember we're working a large work. We're working with a large workbench for for complex systems And so I'm trying to make this as easy as possible to actually get to some good answers This is for me to breathe Every time I practice this I got Pretty excited. I think I'm tired at this point, but I get really excited about this is what I'm doing and Got too excited So hopefully this won't be a real demonstration. I'll just give you some slides of some Code and some ideas and then I'll get you some resources That if you wanted to play around So the first thing the idea was we'll work on just we'll think about traveling salesman problem So that's just the basic idea if you were a salesman had ten cities to Visit How do you travel to each city without repeating yourself? And what's the shortest route? Okay, and it's a kind of a common problem And trying to solve this, you know, if you're running trucks or if you're a lot of things that you might need to do that with And so for this little demo, we're just going to kind of compare a couple algorithms how this might work in the console So you might just write a quick Distance class say just a very simple thing that'll add up an array of numbers. There's nothing special about that except for it It abstracted out your addition and Then your directive might look like something like this you'd go to the console you type directive do budget do So you don't need all these things, but here's kind of some of the features that are working in good directive do Here's a budget. I'm gonna give you I'm gonna name it just be so I can remember I can I can I can adjust it and template and work with it. So here's a name and the server time it was two minutes So I'm gonna work at a fast pace. I'm gonna run something two minutes later I need an answer to know if I'm going in a good direction. I'm going in a good direction I might give that more time depending on my date or whatever And then I'll set up a job. I'll say well the job is say This is the actual name is to cast a kill hill climber And you might be able to see now what I was talking about where you say where I was saying, you know You could just sit in here. Well, well the function is to classify, you know What the function is a search or whatever and it'll go out and figure out all kinds of things It won't be able to run them necessarily, but it'll it'll tell you these are the things you can work with and then Because this is evaluating the block I can actually just pass the class on the stochastic kill climber that says you know the value For this algorithm is distance start run and then the data I just set it up as location I gave it a type of CSV There's a data reference in Integu it keeps track of the data that has a reader a writer and a partitioner That are default or you can override them. So Notice you didn't have to sit here and write out, you know, it would be simple in this case You know just read a CSV file parse through it throw it into your your algorithm But see that's the time where you would usually Spend your time trying to fit your ideas to this library and Instead what we're saying is this library knows what it wants. We know our data Do we already have the pieces in place? Can we reuse them? And I think we can in a lot of times at least the data I've been seeing there, you know with the same readers the same writers the same ideas, you know It's coming out of databases. It's coming out of flat files. It's coming off the internet in standard kind of ways And I can reuse my code and Then the job I think I went through all that so I might run that I might like the answer and I might say well There was this other algorithm was where where of the other one was a stochastic hill climber a little Algorithm I just want to try the general hill climbers just compare the two and so you just do the same thing But directive like last time so just you know, it'll it'll take all the defaults faults that I override the job I say well, here's the name and here's the value, you know a way to compare and what it's going to do is it'll go and it'll it'll It'll it'll take it'll take all your your your data elements It'll add them up and they'll say, you know, it'll take me a hundred miles to go see these ten these ten cities You know the next time it would run and say well, okay I could do it in 72 miles and then it's comparing using this other algorithm. It's doing a a non-greedy It's not looking in every possible combination These algorithms are using simulated annealing to basically say I I'm going to jump around and let some variability help me find something that might not be a maximum local or a maximum minima or maxima Local maxima local minima. You could tell I'm nervous still So that that's what I can do you would just say that and then you might come back and say well I kind of liked that last directive, but let me adjust the directive the last directive and And and work with it, but give it 30 minutes instead of two minutes So there's a lot of just kind of the idea is that you go to the console and just play some some back and forth This is the point where I was going to come into the into my console and start showing you guys here Look how it does this and here does that but like I said, I got nervous. I broke something and Anyway, you can see if you'd like to see So what's next? I've got to put my repository up the repository is is on this box still I'll put it up there and it's just so you can talk and share The libraries that I have worked with on at least a noodling You know, I'm integrating right now. I've got there's and these aren't things I've bound yet There's RS Ruby excellent. I mean you could basically take your R code bring it right into Ruby and you're good You know you you create at our instance You can pass it the way it is or you can interact with it in different ways and it's pretty neat You know, you can do some pretty powerful things very fast, you know, it exceeds my Understanding of statistics, you know all the things I can do in that thing very very quickly Octave is a new version of Matlab and It's got everything under the Sun too And there are bindings to Octave The GSL is the new scientific library and that is a very very powerful large library fan is fast Artificial neural network C base library that has Ruby bindings very nice Swarm doesn't have bindings yet. That's one of my other projects coming up swarm is a Agent base simulation program written by the Santa Fe Institute and it's a C base program I think it was C base pretty sure. I've been looking at it and playing with it outside of of Tegu and But the basic idea is that and I haven't demonstrated this well If you come up with a model by any means That model will will encapsulate a way to repeat itself That's basically all it is it's going to be a job at all of the all of the Parameters you needed to get those results it keeps and then you can swap out a parameter in your model So you can take a model and basically say I ran this this swarm Here but let's change the environment a little bit. Let's rerun it or let's change from training data to To some new data and see what it does or I can take the output of the swarm say Some sort of a state space run some some statistics on the model I take the my swarm model run some statistics on that model To to to bring it down into say a media mean value or you're looking at a maybe do a student t-test and just compare What I was expecting with with how good my model is and then just keep Keep combining models and ideas and you can anyway so so swarm Fftw is a fastest four-year transfer form in the West. So if you're doing continuous data There's a lot of really neat things this one's written in OCaml And I can't remember I was committed to bringing in but I can't remember if there's bindings on that or not So basically I'm going to be spending once I've got the Architecture good. I'm going to be spending a good time binding and working with new libraries And I've been learning actually OCaml just for this purpose just because it's functional language It's a lot of fun that I can write things a lot like I see it in mathematics and then And then I'll bring it down and and bind it in Ruby and bring it into into this environment. So it runs It runs cleanly. It's quick. It's it's it's brief. It's it's Concise in its representation and it's very useful in in environments Far far removed from from OCaml Hadoop and render and experiments they're they're good We're having fun with it. I'll probably end up putting the a Lot of the deployment stuff Into J Ruby if you're gonna go with Hadoop and I've got to do some tests on that The way I'm doing a lot of the stuff to make this make sense in an environment There's a great gem. Maybe you guys have heard of it. It's called automate it it works like a server automation tool like puppet or I'm trying to think of other ones you might have heard of but Basically, you'll write a recipe of how to install say something from source and You can keep track of all the extra things and you can keep track of how to do it from different different environments and You're not you're not Dependent you can either use Say a Debian a Debian package or a Or whatever or you can roll your own and so it gives you a lot of ability to set up the environment With Hadoop, that's that's a big deal. You know, there's a lot of configuration of things But a standard configuration can just be delivered basically, you know for me to be able to save from my development environment I didn't have an image set up and easy to that can do this But I want to run this thing go install on 10 nodes 20 nodes this library and have it written in and Reusable you can sandbox it. So that's what I've been messing with with Hadoop So I I need to get it to that level where basically it says just do it And then you'll have something and then if you guys need to tweak it You can take that and just override the configuration to say I want to run my nodes like this Lots and lots and lots of algorithms we need. Yes, Jake. Oh, yeah, you weren't paying attention Hadoop is the map reduce I was talking about Yeah, thanks. He's a friend. I'm actually rooming with him given a hard time. You're here on the conference Yeah, Hadoop is is a map reduce environment. It's written in Java It's an Apache project and it is Well adapted accepted by a lot of people Yahoo has a 2000 node cluster running That was the actually that picture of those computers. That was Yahoo running on Hadoop So it's definitely not a toy, you know, it'll it'll it'll handle what you need It's probably overkill for most of what I need at least today But yeah, so that's what Hadoop is and that's why the commitment and all this time spent Getting that piece right so Need lots of algorithms. I've got a bunch. I've got a bunch of libraries, but bringing them all into jobs. Is that good enough? Do you guys have questions? Okay Bringing in lots of of algorithms and then the human elements that I mentioned earlier, you know I'll be doing that or anybody else that likes that kind of thing. You're feel free. I did commit I I think I'll definitely go with flex on that. I don't like JavaScript I don't like gooey in general, but I know I need it. So if anybody really likes writing that kind of thing I'll probably write it with flex with a Karen Gorm Microframework if you guys have heard of that stuff, which isn't a big deal if you don't know Karen Gorm It's called Karen Gorm. It's just basically an MVC on the front end. It's got seven layers It's a seven layer bean dip thing So Karen Gorm CAI RNG or M. It's a mountain in Scotland So it was just the cleanest way. I thought I could produce and keep clean a gooey that other people could use So anyway, if anybody's interested in that and I am flexible if people have better ideas than mine It's kind of a big deal is is you know, throw it out there see what better ideas you have and and make it available and adjust it so Consideration I'm also looking at different ways to do messaging and cues for the clustering possibly even RDF it's kind of a enterprisey thing to do but a lot of the The ontology could be formalized and using other tools So in other words, I'm thinking about this as I want to come in here and do some different kinds of works but there's other ways as other other approaches you could do if you're very If the ontology if of all the algorithms that we can combine are big enough And we really get into this and have a fun time and make a great scientific community out of this or a great You know machine learning community out of this RDF might be a way to kind of get get to the information quicker, but we'll see But this is brought to you by a lot of people smarter than me I've combined a lot of things. I've written a lot of classes and figured out I think how I can safely give you what you want without giving you asking you to deal with the details every time differently So units is just a cool little gem RS Ruby minds are I Went with data mapper, and I'm not sorry at all. I really like that a framework I need to persist a lot of things that we do here. There's a lot of details. I haven't shown you that I need to make sure Just so that if I need to go back and retrace my steps I've got an error or or whatever I needed to defend it say on my dissertation when I get to that point I need to make sure I know what? What I did, you know that I you know, it's good God we use God and Vlad. I think I'll go with Vlad and maybe Capistrano I was working on that actually a little bit last night. I got tired RGL Octave Ruby fan others Resources right now the one to go do right now is the Google group Groups.google.com slash group slash to what I'll do is I'll announce there when I think The code is good enough that you should play with it probably this weekend You can just get it on get when I give it to you. I just What I want to do is at least go through fix what I broke today and go through all my to-dos and just make sure none Of them are critical anymore, and it'll be alpha for a while 0.1. It's big. It's an ambitious project, but I think it's I Think it's justifiable. I kept beating myself up about how monolithic this is becoming but but I think that the idea is that if if If you can feel comfortable that you understood what you did and that you could use new ideas that you didn't sit down with Then it justifies a monolithic framework, you know, you don't have to know the framework to use it And and I'm trying to be able to to increase the usability Quite a bit So anyway, so that's the idea. I will I will tell on the Google group when I'm ready to go IRCF I've asked for the T to go dash anything. We'll see if they give it to me on on free node I'm sure they will but I Just I set that up on my computer whenever I boot my computer that opens up Obviously, nobody knows about it, but you guys this is like brand new stuff So, you know you guys know and my girlfriend knows and my my dog knows, right? And so that's it But but the reason I was important for me to do IRC is that I I'm a big proponent on on pair programming if you're having a hard time I Spend two or four hours every day on this and so I Can commit some of those to to working with you training we can go in and build the features together So you feel like you're it's comfortable. It's not a doesn't have to be a lone wolf problem Tigu hubs where everything's gonna go and then get is you know of using a Get hub for that. So I Think that's it. Thank you any questions. How well will I do that? Explain anything make any sense? I've got a few blank stares and a few excited eyes Yes, yes, there are actually I'll tell you some of the exciting okay Lab book is a good thing, you know The lab book is going to basically needs to be there's gonna be a lot of data on purpose and a lot of even duplicate data Just so you can you can slice and dice in on what you did what you didn't do and Justify your analysis So that's one part of human elements, but there's this other really exciting thing. I didn't even talk about this Okay, once upon a time World War one we're trying to figure out how to fight They bring Thomas Edison into Into the war department back then they say I'm gonna do a good on time guys Do you guys want to go start signaling me? I'll I'll shut up Do I really felt like three hours up here? So they bring him in the war department say how are we going to fight this war says well This is what I do in my lab and he helped out a lot World war two we forget all of that and ran core air forces got the same problem How are we going to fight this war ran corporation evolves out of that and they say this is a methodology for studying complex problems and There's ways of sharing information You know you guys have heard of brainstorming which is kind of cool But chaotic right down an idea right and somehow you're going to come up with a better way to do Things well there's actually better ways to do things better ways to get ideas out of domain experts And so the human elements is big on that where there's something called up. Well, there's the delphi system delphi method There's something called icm These are basically ways of gathering ideas about How a domain looks and feels So if you're studying something and you know a little bit about say you're studying the weather And you know a little bit about this part of the weather say Well, you could bring into a room 10 or 15 people and basically say What do you know? what do you think and in human terms or in Domain language in the way that you would talk about it You're going to be able to gather the nodes of the the elements of the system And then at that point you start playing dynamics you start playing well This touches to this and this touches to that and this touches to that Okay, does it really run it against the data make these into models Run a build a model for each one of these up, you know convene again in in in a week Let somebody go and play with the data and say, you know, it really doesn't fit that idea or it really fits that well But look at look we've got a bottleneck here You know and you can start to visualize and compare and bring people into a problem um in different new ways And actually You know i'm kind of talked about the analysis core because that's what i've been working on to making sure We have a core where we can get the work done But the idea is if you know not everybody likes programming not everybody gets excited about algorithms So the human elements is Is that Yeah, i'm yeah, i'm really excited about that stuff. I'm absolutely committed on making it good But this has got to be really good First so Other questions thoughts concerns Complaints Okay I think this weekend once I go through all the to-dos i'll feel comfortable with saying yeah go for it It's not going to break your system So Yeah, I've got you can run quite a bit with it The other thing I wanted to do to for for sandboxing this Is there are some data sets I can I can make available that you don't have to come up with your own data You can just say well, huh, what would this do? against that And and you can just to start playing with it and feeling comfortable with it Give me ideas on how to make it more natural You know, I think what I did I think the class structure is Defendable I'm sure I can improve it, but I think it will do the job But the dsl I've got a dsl there that that I think in you know How natural is it? You know we need to iterate through that and make sure that it basically as you're thinking about a problem You know, let's let's let's give it some Let's give it some air so I'm going to go to tigu hub. I'm not on at the moment Just because I shut everything down just so I wouldn't break the computer I was a little nervous about that about last week I had to reload my operating system for somebody who ever had to do this on a on a mac And that got me a little nervous right before I came out here. So Everything's off just in case I've got something funky. I write I don't write I load All kinds of funky libraries on this thing just seeing you know, what would this do could I use this with tigu? You know some of the stuff's old code and some of it I think I think what I did is I wrote brought something in that was just kind of nasty that Yeah, it's okay. It's not a well-formed presentation. So That would be very exciting. I think I've been having some conversations just this weekend about some of that And I've been having some conversations. There's a guy back in april out of stanford medical that is doing this in ruby and We're collaborating so or he's got some good ideas and he wants to know my ideas and he's excited to get together So I think that's a great idea Yeah, that's um Well, actually that slide the the ecosystem slide Was almost pushed on me. I don't is down in here There's a guy been hanging out. I got to know here at the the conference and he was saying from that perspective You know, we need to be thinking about this try this try that I'm thinking great. I'm just thinking how do I get a good answer? So, yeah That's where's that Yeah, and you'd basically, you know, whatever environment you're working in, you know What what can you do and share and what would you at that point you you would have to find a way to do that? But I would absolutely like to to get it back out so Exactly that's um, I think very swappable What you would end up doing and I hope we we do this in plugins as well Is you're going to write a reader for that that's going to do some transform or You're going to write an extract transform load a full formal thing um And you can use that and it's very swappable Basically, you would write in the console use this and you name the class and there it is And then if you put that into a class And you could share it people could get the same thing, you know extract transform load is going to be a big deal I mean, you know, once you're using your data, you're getting the same thing over and over But you know, it's kind of common per domain How it looks and feels I had looked a lot at And I'm not quite sure I'm happy with my decision. I left the idea of there's active Active resource etl active resource and then active resource etl and that's an active record based Jam or two gems that's been around for about two years That does really good with what you're talking about but The support and the community around that seems to have petered I used to follow that mail list for last year and a half or seven following that one and And I like data mappers So I think we're going to end up writing our own readers and sharing them in readers and writers and And and that is something that the work workflow manager is very aware of, you know You name a data type something obscure csv the way we do it in our domain You know, and and you can assign a default reader default writer and a default Partitioner for it And um And now it's in the namespace it's in the ontology and anybody can work with with what you do if somebody's like you so I'm serious what? Definitely what I do have it was just a monkey patch Hack at this point is you can go to any class and ask it for its dsl and it gives you a list It helps me keep track of but it's going to be better than that And the wiki is going to have to be that too You know, there's going to have to be some sort of community driven documentation that says You know, this is an obscure algorithm, but it's really got a simple interface You know because what we're really talking about is using algorithms like legos, right? Well, how crazy is that? You know, I mean and the one buffer you do have is that, you know I'm willing to with the cluster. I've got if you send me your repository or your plugins I'll I'll publish them, but they'll be at least I'll run your unit tests first and I'll And I'll run them against some other data sets. So there'll be some benchmarks You know, so you'll have some idea, you know, it's like Fuzzy fuzzy idea of you know, you've got a benchmark of how it ran against some kind of data that you might be familiar with It might kind of give you give the workflow manager and you an idea that this might be good But you know That's not anything we'll ever get precise on I don't think unless somebody smarter than me says this is how we do it So but but definitely at least it it'll be available to to be able to to do this This all seems very interesting to me, but I don't really know much about There is a book coming out for Hadoop one of the guys in there is writing one. I think he said it was O'Reilly It's just gonna he's gonna it's gonna be a What do they call that with O'Reilly the work in progress books? Rough cuts. Yeah, it'll be a rough cut books. You can watch it as it goes and it's coming Coming along and he'll deal with a lot of that Uh There is an old O'Reilly book called performance computing High performance computing. Um, I'm not as confident with all that. Yeah, it's like 10 years old now So there's some things there you might want to at least see it at the library Um, I've found that Springer the publishers They're amazing Well, it's what's the other part of the name Yes, I just remember Springer because I can remember that Yeah, yeah Basically, you'll never go wrong with that stuff. Um And they've got a whole set if you go to a university library um I would try around um qa 295 297 There's going to be huge huge sets of volumes from Springer that uh Anything you want to know about this this stuff is there basically. I'm not everything, but it's just pretty amazing Uh, you know look it up, but but I'm there a lot just trying to find new ideas and so and I'm like Learning what I can I think I'm getting laughed at here because am I just too nerdy or too too wrong or That's that's one of my short hands when people say system science, what is that qa 295 go look it up But yours you 297 is where you start getting more in the data mining, you know, it's down the down the down the way So anyway Yeah I've closed those libraries. I I'm I'm out of Portland, but I I grew up in Salt Lake And kind of the bigger one of the bigger libraries there is that Brigham Young University That's where I got my master's degree or one of my I'll get it. I have an MBA Um, so related but didn't give me the the meat. I wanted so I went back Um, so I'm going back there and it's kind of funny, you know byu because They're big on rules. They're really structured very very Very very conservative, you know, so when it's time to leave the library you leave the library So I've been hanging out of that section a lot of times when they're closing it down And they're just just you know, they put on some really really loud music and just get you out of there quick Is this is the rule and you're gonna have to do it But anyway, yeah hanging out of libraries Huh questions thoughts ideas Who I wanted to ask I wanted to make it more interactive, but I couldn't get over my nervousness. I'm a nervous guy in general Python I haven't used sage, but I know that python is excellent at this stuff. They kick our butts In some ways What's that I'll have to look more into that at least for ideas One of the big libraries I'll be working with in portland. Um, they did everything in python I've been trying to convince the professor. Let's switch The switch is to ruby because the guy that's been maintaining it is leaving and I'll be maintaining it And let's do it right. So let's do it with tigu So right before the summer I said this is this great thing and it's going to be announced in austin and there will be support in the open source community and It will reduce your workload by so much if we switch this so You know, hopefully he'll let me throw away his stuff and you know, I'll just use the c++ libraries and build from there Python is excellent with this stuff. I looked at haskell and I kind of decided that I wanted things my way and I was a little timid. There's so much I'm learning I thought oh camel was enough for now so Haskell's really good at keeping You know some of this this Orling too You know and and I've been um I spent probably three months making sure I wasn't being silly and I probably am You know, there's a lot of qualitative decisions that were made um, but I I feel comfortable with where I'm at so far And I've been asking a lot of people and maybe people know It would be cool that if we need to extend this with some of these languages or play or Rip out major sections. How fun would that be? You know and and how easy would that be? So I'm pretty draconian sometimes with You know throw away A lot because I don't like it anymore and make it make it perfect. So Or better Anyway Um, I was going to say something. I forgot what it was Yeah, I was going to ask who's who's excited about who who would find that this would have something to do with their regular workflow or work life or okay If you guys want to get involved or help me out or um Tell me where I'm dumb You know or or whatever Or or pair program or anybody but you know, I just kind of wanted to see faces so that if if we did happen start collaborating I've got a possibility of knowing who you are but um You know, I'll give you I'll give you up to four hours of my day I still I have to make a living and I still have to um learn something at school and get passable grades so But you know, if you want to just get in together and rub shoulders and write something You know, especially because there's going to be so many Algorithms that maybe you're already using or you need to have It's pretty minimal to to write up a job You know, you just have to The the interface is you've got run run map and reduce or the three methods And if you're going to be able to run in hadoop or in a map reduce You know, you got to have map reduce and otherwise just run it and then you've got to find a signature Which will have any parameters and things you need Um, that's those four things you need so you can take whatever classes and objects you're already using and just subclass to a good job and give me those four things and it'll run But the signature is really really important Well, they've got pipe well, they've got streams which becomes pipes Uh-huh. Yeah Yeah There's there's probably a lot we need to look at they do have if I can't do in j ruby I wanted to do in j ruby because I was getting frustrated. There's something called streams Which is a socket interface that you can bind um So it's getting into your your java Um through streams and then pipes is is is the binding so I can bind To ruby pipes And then just play in that interface But what I'm hearing a lot is that there there are some Limitations with what they let you do through that interface And so naively I would wanted to consider j ruby. That's partly why It's got to be done basically the next month Or else, I mean you're really busy. It's good. It's got to get done. But um, but I was I'm naively hoping that if j ruby I might be able to just get the real pure hadoop There's two things. There's pipes and there's Because see you're running in you're running the jobs through pipes or pipes to streams to hadoop And then there's a distributed file system hdfs that they have and um You can figure that up and that's pretty nice But but that one I've heard is also limited but that one's also they've got a binding To that as well. So we can just go that route if we have to stop at that point And that might be simpler easier or we can take on something else So but there is definitely there's definitely a way to get it done And it shouldn't be more than a a weekend to have an alpha something done on that but um And see and that's going to be critical to if you're going into the ec2 space Um to have both rinda and uh hadoop available You know and rinda I think is basically because you can run you can run a monolithic process of any type Whatever your libraries are wherever you're coming from just run it as a service and you're good And then if you can get it down to a map reduce you can probably use your resources more efficiently Um, and then if we want to go with the different Parallel system as well I mean, there's probably a lot of other things outside of that And if some of you guys are kind of locked into an approach Let me know I think that it shouldn't be too hard to take what we've got and And at least find a way to get it done so We'll see I don't know I want to predict the next president That's just you know, I mean I Yeah, I mean just because that lends it so well you get the poll results And you can build models and layer things and noodle about why is the dynamics going the way they are You know and then you could ask things about not just how is this platform working with this population, but how is uh How is the negative has and how you know how you know I mean you could you could come up with models and ideas about You know keeping people from the the polling places. So I would love To do that And I think I can at least have A more informed opinion than cnn would give me Maybe i'm hoping to get that far Um, especially if it's a close one that would be a lot of fun and the other one is um, I'd love to see if I Put my hand in the um In the um netflix competition and see if I can do any good if anybody else wants to do that That might be a lot of fun Yes, I got that Yeah, so we're coming down to basically I got to go through all my to-dos. We'll release this if you guys want to play on your own or with me Yeah But yeah, I would love to do that because that's a very Broad spectrum of algorithms and ideas and classifications and Comparisons and and just to see how you know, let's let's let's take this thing for a drive The netflix this month election next month World domination is that? No, yeah, um There's going to be with the genetic stuff. I think I'll be right hand in hand with that and I think this is solid. I was approached by a professor up there in portland that said wells fargo needs Some of the analysis we do at portland state and they're hiring us and We might be able to get uh tigu to work on that At least with some of his algorithms Um, and I'm going to go back. I used to consult for another bank right before I went to school I was consulting at a data warehouse group and uh I'm going to hit them up see if they'll hire me to to do some things their guy is a he has a master's degree in applied science applied mathematics and uh He's big on analysis and he wants to do more and more of that He loves a small agile and that's the thing too about this you can just do a small agile little analysis without reinventing the whole world So, yeah, maybe designs will Will want to do some interesting things, but you know where I'm going with that is Yeah You know trying to break ground on first. Yeah Yes, oh Yeah, that's probably a great Great thing. Yeah, I'll be using using um the algorithms. I need to Use on that. There's something called reconstructability analysis, which is just a way to cut down the the The problem space um It uses information to figure out what's information theory to figure out what What might be a good good models and then I'll run my models And I wanted just for for sanity's sake if anybody's interested non-monotonic reasoning, which is basically um a non A non Probability-based reasoning system that says This is typically that and that is typically this you know a bird typically flies and penguins typically don't fly And roasted birds typically don't fly, you know that kind of thing So it's just basic a basic classification But to be able to put that in there because how fun is that to be able to say well You know all my models are possible models. Let me give them a non-monotonic connection And then it can infer from there What other things it does or doesn't know and help me help me manage the model space better. So that would be a fun Thing to do and that's just I think it's just going to run that and probably dynamics and some of the other Things will just run like any other job, you know, it it happens to work with lots of models But it's just another job You know another thing to do so that's some of the areas I I think Would apply well, especially like you know something like with the presidential election, you know I can sit down in English terms say these are here's a classification of how it might work and then test and see it so Yeah, lots of neural nets lots of bays lots of other class buyers lots of things Who knows? I don't know Lots of frustration and coke Other thoughts I guess Are we good on time you want to go? That's six minutes. I started 10 minutes early. So you got a four minute bonus All right. Thanks you guys