 All right, in order to keep things on schedule or roughly on schedule, we'll keep doing. Our next speaker is Chris DeVyreyes, if I spelled it pronounced correctly. Talking about his experiences with the KTRI project, which was part of his work for his masters here at QUT and is continued on to his PhD, is that correct? Yeah, that's correct. Excellent, excellent. So without further ado, there you go. I'm here today to talk about my experiences about releasing my research as open source, and I'm going to try and convince anyone else that's doing research to do the same. So I've been working on the KTRI project, which is a machine learning project, and I started that in Masters and I'm continuing on with that research. So firstly, I'll do a quick overview of the project to talk about some of the technical details. I'll talk about research and open source and why I think they're a good fit, and then I'll talk about where the work is going in the future. So there's been three contributors to this project so far. There's my supervisor, Associate Professor Slomo Giver. He originally came up with the idea, and I've continued work on it, and so is Lance, and Lance is somewhere in the audience here today. And I'd like to thank QUT for providing scholarships so I can not have to work too much while I work on this stuff. So I've got a little bit about me up here. I'm a PhD student here at QUT. I've worked in industry in different roles before, and I've been using Linux for a few years, and you can find me on the internet at all those different places. So the KTRI algorithm is sort of a scalable approach to clustering data, so you don't know what's in the data, you want to find something interesting. So the way you do this is you group similar objects together, so you usually optimize some sort of similarity. So what's interesting about the KTRI is it's fairly computationally efficient, so it's inspired by the B-plus tree. So these type of data structures are used in file systems and databases. The difference with this is it's multi-dimensional whereas a B-tree is only one-dimensional. So it uses the popular K-means algorithm to perform splits in the tree because it's harder to say where the middle is with multi-dimensional data, and it actually approximates the K-means algorithm by making lots of little local decisions. So where this has mainly been applied is to document clustering, so the idea with this is to automatically find topics and documents. So each document is represented as a multi-dimensional vector, so all you do is count the words in the documents, and then each term becomes a dimension. So this is supported by the distributional hypothesis that says context that uses words in a similar manner have the same meaning. So if they're close to each other in the space, they're probably talking about the same thing. So what we've also done with KTRI and found to be effective is something called random projection. So basically a recent realization from machine learning and statistics is that high-dimensional data usually lies on a low-dimensional manifold. So it might be a million dimensions in the data, but it can really be explained by as little as say a thousand dimensions, and there's been other computationally expensive processes to do this, such as singular value decomposition and various signal processing approaches, but what's nice about random projection is you just generate a random matrix, multiply it, and it's a linear complexity and just projects the data down to a smaller number of dimensions that explain what's going on. So in this case you'd have n documents with say a million terms, and you want to project that down to a thousand variables that explain what's going on. So R is the matrix that's the product, so you still have each document represented as a row, and it's in the reduced dimensionality space. So this is a figure of the KTRI here. The nodes are represented by the boxes, and each of the circles is one of the vectors that are indexed where the data points you insert into the tree along the bottom, and the other data points summarize the data so you can find the original data. So this is what happens when you cluster some data in 2D. In this diagram there's two normal distributions. The one at the top right has a smaller standard deviation than the one at the bottom left, and what you can see here are the lowest, most fine-grained clusters found by the algorithm. So the points in the middle of those tiles are the cluster centers, so they represent all the other points you can't see on the diagram. So what clustering basically does is another way to look at it is summarizes data. So this research has been going on for quite a while now. Slomo originally started in the mid-90s. He published a paper at a small conference in Sydney. He found there wasn't much interested in it then, and then I came along and thought it was pretty interesting. So I started in Masters, and we started a project on SourceForge, and there was more interested in it this time, and I think that's because the source code was available. It just wasn't some abstract description of an idea. So I think if you're doing research, and particularly because most research is funded by government organizations, I think it's the right thing to do is to release it so people can use it. So when you release it, it increases the value for yourself and for everybody else. So it makes it easy to go from research to industry. If you release your library and somebody who's a software engineer that may not fully understand what you're doing, but likes what it does, can download it and use it in that product, I think that's a good thing. So the other benefit is if you have something that's usable, people are more likely to decide it. So this is, I found this particularly true in machine learning. If you look at libraries like SVM Light, they've got a lot of citations by creating a piece of software that was efficient, easy to use, and had some research aspects in there as well. So sort of the scientific community encourages sharing of data and results, and I think it's a good idea to do the same with software. So there's a famous, well, popular blogger called Daniel Lamaya. He blogs about research, so if you're a research student, his stuff is always interesting to read. And he had five points to say about this topic with regards to, he had people question him about how to affect your competitiveness as a researcher. So obviously some people don't think you should release your code because somebody else will come along and steal it and take your idea. And I agree with Daniel's point of view is that that's just not the case. It actually helps you. So his points were that sharing can't hurt the small fish. So if you're not a giant researcher, that's really known worldwide, people aren't probably going to steal your stuff. And if somebody does steal your work, it probably means you're doing really good work and you should be proud of yourself anyway. So as I've said before, sharing your code makes you more convincing. Not every field has this luxury. You can't just upload your experiment and have somebody else run it because it usually requires a human rather than a machine to run it. So I think computing researchers should definitely take advantage of this good opportunity we have. So the source code also helps spread your ideas. As I've said before, a paper can only describe so much. There's a lot of implementation details and making that available makes it easy to see what's going on beyond the paper. And as I said earlier as well, it raises your profile and industry stuff. There's a software engineer out there who wants to implement something. He doesn't have to implement what's in your paper. He can just download and use it. And the last point was that if you share your source code, it's likely to be better because you're going to take pride in it because other people are going to look at it. So there's a journal called the Journal of Machine Learning Research and this journal split off from a journal called Machine Learning from Springer. It's an open access journal. I've seen it being cited as one of the top 10 journals in computer science for impact. So they think releasing implementations of machine learning algorithms is important enough that you can get a paper in their journal just on an implementation. So that goes along with the Machine Learning open source software project and that's all listed at the website there at MIT. So I'll talk about the impact it had on my project. So my supervisor thought that nobody was interested in this work. He published a paper. Nobody published any continuing work. But once I released it as open source and people started downloading it, this changed his perception. So I think releasing your work as open source is a great idea and it's just another way to get it out there. And I've actually had reviews of papers come back where people say they actually like the fact that there's some open source software associated with the research. It increases the confidence of your peers when they review you. So I've been using Google Analytics to track who visits to KTRI homepage and people from all around the world have come and had a look, downloaded papers, downloaded the source, pretty much everywhere except Africa. Oh, there's someone in South Africa there who's come. And yeah, not much in Canada, yeah. The one on the left there is probably Vancouver. And there's Toronto there as well by the looks of it up in New York. So every researcher that works in computing and writes software has this conflict, should I be writing paper or should I be writing software? I think there's a bit too much of a focus on writing papers. Everybody has to publish or perish mentality. And I think good papers are supported by good software. And as I said earlier, software can help improve your impact and there's a big change in research recently. People are focusing more on impact. You'll see people quoting what the acceptance rate for the journal they've published in is. So I think releasing the software is important. So I've heard maybe some people don't want to release their software because they think it's messy or unusable. I don't think that really matters if it's a good idea and long as it works, if it's not quite what they need, they can always reimplement it easier by having some software there already. So especially for continuing research, if somebody wants to build on top of what you've done, they can just take that and do something new. They don't have to reimplement your abstract description of what you've done. So what would I do different? If I was starting now instead of two years ago, I'd write and release code more often. So I'd do more releases. I'd implement stuff that may even not be worthy of publication and still release that as part of the project. I'd provide an end-to-end solution. So at the moment my project has no parser. So even though we're focusing on information retrieval side of this, you can't parse documents. You need to do that with something else. I'd like to make it so it's easy to download and use. I haven't included any of the evaluation software. So if you've got a different algorithm you want to test against mine, you have to go implement all that yourself. So I want to put that in there. And I'd like to make all my research entirely automated. So instead of reading the paper and seeing the results, you can download the software, run the experiment and see the results yourself. So hopefully I'll be able to finish all these by the end of my PhD. So for future work in the project, we're going to do another release in Java. We're going to do a future release in C++. I know it's not the nicest language, but it's useful for what we want to do and we can expose it with a C interface. So it's easy to bind to your favorite scripting languages. I want to implement other people's algorithms that are out there. So to sort of make a whole toolkit. So it's easy to do a comparison of different approaches. And I want to integrate all of my existing research that I haven't put into the package yet. And I integrate it with another search engine. So it has an actual use rather than just being some machine learning algorithm by itself. So that's the end of the talk. Thank you all for listening. Were there any questions? Okay. Just wait for the microphone to come over so we can get it on the recording. Thanks. I've also released some software as open source. And what I found is that the feedback that you get from users really helps in improving and also finding bugs throughout the code base. What was your experience? The only bug that's been fixed is one I found myself. So I have had other people use it. And it's been pretty good so far. There may be a few other edge cases I can fix there. I think at the moment there's a bit of a barrier to getting into my project because it's, as I said, you can't just point it at a bunch of documents and get it to the cluster them. So I think once I do that, I'll see more of that because what it actually does is I understand pretty well and I don't think there's too many problems in there. But as it gets bigger, I think I'll see that. I'm not sure. Are you from Australia? Oh yeah. Yeah. One of the trends here in Australia that I've noticed in particularly in universities is that when students start, they end up having to sign an IP non-disclosure agreement basically at the start of their PhDs, particularly if they're in reception of scholarships like the APAs and stuff like that. And these agreements basically sign all of the IP that you as a student generate to the university for their exclusive use, which means that when you leave the university, you cannot take the code with you and you cannot release it to anyone. The university retains the rights to that code and indeed anything that you generate is part of your project, which kind of sucks. And a colleague at the place that I work actually ran into precisely this problem where they couldn't take any code with them. They came to work for us and had to basically rewrite all their stuff again from memory. How did you get around this little problem? And do you have any suggestions about how people can because it is becoming a massive problem? When I initially started, I wasn't on a scholarship. I just had, I was I was on a funded position, so I didn't pay the hex fees or whatever. So I hadn't signed an NDA then. I did sign one later, or not an IP agreement or whatever it is. But what I've actually found is I think that actually having an implementation that's open source has increased the interest from people in industry. We have had people and that hasn't stopped things from going forward in relation to that. So I completely agree that it's a great thing to do and you're preaching to the converted here really. I'm just wondering, I mean this is a real problem that students are having now that they have to sign these things and they can't get and a lot of students are on the scholarships from the word go. They're forced to sign these IP assignment agreements before they start their research project, which on the face of it means that they've got no way of releasing stuff as open source down the track if they want to go down that route. Do you have any feedback for the students in those positions or any tactics that they might be able to use? I wouldn't encourage them to break the law but I'd say just go ahead and do it. I'm working at a university and I'm familiar with this problem and so one of the best things that you can actually do if you get their permission while you're under the NDA or the IP assignment or whatever that's going on to release code that actually can't be retracted so make sure if you are on one of those things get permission release your code under GPL and just keep releasing it under GPL and then at the end you can simply take your code from the GPL version and circumvent it that way. Also students can actually challenge the default terms of those assignment agreements so make sure if you have the opportunity to do that at the outset and you can say look I want to be able to release this for public good down the track and work with it. Remembering of course that the GPL you can always assign your copyright to the university and release under GPL under their copyright which would probably meet the terms. I just want to make a quick observation. Both universities I've been involved with in Perth have actually not had that as a problem. You've actually kept copyright and been able to re-license so it's not a universal thing and I think there's probably just an education thing as much as anything with the university hierarchies. I think if you can actually make the case and it's not terribly hard to actually get it approved you've just got to actually make the case. Okay well it sounds a lot like the the BSD license because isn't that that you know it's all under the whatever the Berkeley thing and that's what they'll release. Okay we'll make this the last question. It's more of an observation. What you're doing here is part of a much wider concept which there's a guy called David Donahoe at Stanford University who's been putting forward this idea of reproducible computing and that's basically making sure that people can get things down to the bit level reproducible. So it's making research sort of more transparent and open. So it's a great thing and I just want to make sure that you were aware of his work. No I'm not aware of what he's been doing so I'll take a little of his stuff for sure. Thanks. Okay so thanks to Chris for his talk.