 Welcome back to 098, this is Zane Gilmore and that's Ben Warren. He'll talk later. It's going to be very exciting because they're going to switch. And let's welcome them to have a good talk. I'm Zane Gilmore as we've seen before and Ben Warren is going to be here when I see it. And today we're going to be talking about how we use open source software and our scientific things that we do in planned and food. But first of all I'll talk about who planned and food is and what we do and why and we'll go through some examples etc. So we're a Crown Research Institute. The Crown is the Government and there are the most of the others. I think there might be another one that doesn't quite call itself a Crown Research Institute. There's Ag Research, ESR, Scion, GNS, Landcare Research. Ag Research does animal stuff. ESR does CSI for New Zealand. Scion is the forestry people, GNS are the geologists. Landcare Research are the ecologists and they were the weather people and we do planned and food. We had a revenue of about 120 million last year which is small bickies for even some of the likes of Red Hat and Google but we're a Research Institute. We're supposed to earn half of our revenue from private stuff and there are some of the about a thousand-odd people and two-thirds of which are actual scientists and research people. Two equivalent programmers on the only full-time programme in the IT department. That's where we are in New Zealand and there's a couple of places. We've got a couple of sites that are Aussie and one in Davis, California. So what do we do? We do research on plants and food which is fairly obvious. We do breeding for things like kiwi fruit, apples, peas, potatoes, strawberries and we make money from creating new cultivars of all these things. And we also help farmers farm their crops more efficiently and all the rest of it. We try to find cures for diseases and insect problems and we also do research with food, nutritional health, things like gut health for people. If you eat a kiwi fruit, how does it improve your gut health and things like that. And we do nutrient analysis which I'll be talking about later and we also help food manufacturers do their thing better. We also inherited the seafood and fishing stuff. So we've got a few people in Nelson and just recently there was a big thing on TV about how these guys had developed a new fishing net. Which let all the big ones out and all the small ones out and just kept the middle-sized fish. So the idea being that we try to preserve the fishery. So it was all kind of cool. There's other things that we do. Electro-spinning is the one that springs to mind it. But even that is... Electro-spinning is where the nanofibers they sort of extrude out this couple of nanometer fibres onto a charged plate and all that sort of thing. Acoustic tiles and things with it. But even then they're using proteins made from wheat and all that sort of thing. So it's sort of in our wheel here. So that's what PFR does. It is a research institute doing biology where we're striking some problems. We're finding that as the technology gets better and better for scientists to do science, the amount of data that they produce is going up faster than Moore's Law. When the original Human Genome Project was done and it was finished about 2003, it had cost about $2.7 billion to sequence the first genome. Within what was it about 11 years, it's now costing around about $5,000 to sequence the Human Genome. The amount of money it costs to sequence a genome has just plummeted. And that grey line there is the Moore's Law line. So there's no way IT can keep up with the amount of data that's being produced by the new technologies that are coming up. So scientists are finding that if they want, they've got all this flash new equipment and then they say, I want to store it on the shared drives at work or whatever and we don't have a hope of producing terabytes and terabytes starting to hit large proportions of petabytes and we just don't have the means to actually store the data that they are producing. That's becoming a problem for us. But it's not just the storage of the data. It's also the processing of the data. Scientists like a soil scientist that would have some trial plots out in the field somewhere and you'd go out every day and take the readings from whatever he was doing. Now he can set up these data loggers that can take a reading every minute. So the amount of data that they are producing has gone up by many orders of magnitude and they have storage problems but they also have their ability to be able to process the data means they can't use XL anymore to do it. Just recently I had a young scientist. She was working on one of these projects with data loggers and she had been taking, this piece of equipment was producing a line of CSV data every minute and this was around about November and it had been taking one every minute since April and so there were about two or three hundred thousand lines of data in the file. It was a CSV file and she tried to put it into XL and it brought her machine to its knees and so all of a sudden she had to learn how to do, I went and put it all into a Postgres database and she had to learn what SQL was for a start and also she got a bit of help from some of the old fellas that could do R and Python and all that sort of thing. So in the end they managed to get some good stuff going there but it was touch and go there for a while. Open source is obviously helping there so R is a statistical programming language which in this forum here people have sort of heard about and heard that it was a crappy language but it's actually quite a good language for scientists doing statistics and there's lots of good tools around R and all that sort of thing and there's Python and Peel, bioinformaticians love Peel but Python is sort of rising in prominence nowadays. So drinking from the fire hoses is definitely one of the problems that they're having the amount of data these guys are producing, they can't process it fast enough a lot of the time. A big part of what makes science tick is what we call reproducible research. A scientific paper describes what a scientist does to produce a certain result and so when somebody reads it they say oh that looks really interesting result let's see if I can do the same thing and reproduce the result and when you're putting huge amounts of data through lots and lots of complex data manipulation stuff you need to keep track of exactly what you've done and so iPython is da bomb for this sort of thing I've got scientists who just love and they just sing the praises of iPython to a huge degree here this is a screenshot from what we call the rain shelter which is a trial where they've got a thing with a 20 by 30 meter plot which they can roll a great big roof over the top of when it starts raining but they track soil water capacity and the major amounts of water going on and how long it takes to evaporate and given this much amount of sunlight and then they're grafing it these guys are only taking samples every quarter of an hour but there's about 10 times the amount of data that they're taking every quarter of an hour so yeah iPython, wonderful piece of technology which our people love Being able to open source what we work on can sometimes be a problem because we're primarily a biological research institute and biologists as a general rule are not good computer people geneticists are, genetics and computers have gone together since the start of computing genetics can't happen without computers biology not so much a plant physiologist or a food scientist or an entomologist, entomologist just needs a microscope and a piece of paper and they are away and so it's only recently that computers have even become useful to them in any significant way so all the old fellas that control the checkbox at a research institute are all these old biologists who never really used computers during their early career and so when I go and talk them into spending a few hundred grand on developing a piece of software and say, now I want a GPL it, they say you want to give it away and so I have to, there's sort of battles I have to fight it every step of the way along there however they are scientists and scientists do listen to data and they're supposed to be intelligent it's sort of supposed to be part of the job speak for being a scientist they're supposed to be intelligent and they do respect data as a general rule but we are making progress I'll be telling you about an application I've been working on and I've got another potential client for that particular application here in the room and I'm hopefully going to be GPLing that hopefully soonish the CRI funding model is a bit of a problem because we only get sort of 50% of our funding from the government nutritional analysis is a big part of my job I look after the New Zealand food composition database we track over 2,600 foods so if you want to know how much fructose there is in a kiwi fruit or how much polyunsaturated fatty acids there are in a steak or whatever we do the analysis and now we publish the results in the URL www.foodcorposition.co.nz anybody in the world has allowed to download that particular data set we've got good data we track over 300 nutrients in foods in some of the data sets but that's like 120 different fatty acids and every one of them are named so there's all the polyunsaturated fatty acids and the monounsaturated delta delta 3 fatty acids all of them there's about a dozen sugars but there are shortened things that are actually useful to most of us and you can, so if anybody wants to write a really cool app with that data, go for it it's quite a good data set trouble is that right now we're using SQL server to store it I've been given the okay to rebuild it but we're still in the process of going through all that everything happens over long time periods at a biological research institute and yes, we produced these data sets for the Ministry of Health they give us lots of money to do it so yeah, they're giving us most of the money to do that Ministry of Health and the Food Safety Authority to a certain extent yes, we've been given the go ahead I'll be doing it in Django and Postgres stack hopefully we'll be doing that in the next little while to give you an idea of the complexity of the system because I'm looking for somebody to help me build it there's an attribute calculator and the recipe calculator these are two calculation engines in FCDB the attribute calculator calculates attributes within a food so the example of that is how much energy a food has and it will look at over 100 other attributes to calculate how much energy there is in food all of the fatty acids, all of the sugars organic acids, alcohol you'd be amazed at how much energy there is in alcohol and yeah, so this thing's got all you put in all your formulae for calculating things like energy and things like poofas and moofas a poofa is a polyunsaturated fatty acid and a moofa is a monounsaturated fatty acid so all that sort of thing and then we've got recipe calculatives because we've got foods in there which are just defined by the recipes so that we have to calculate what we call the nutrient line of a food from its constituent ingredients and when you apply a cooking process to the recipe then it changes things like vitamin C is destroyed by heating so it gets real complicated doing all that stuff they're called retention factors and it's really expensive data to get and we thank the American Food FDA I think they produce a lot of that data and then there are recipes of recipes and a good example of that is a meat pie so a meat pie has meat stew as a filling and it's surrounded by pastry so we've got a recipe for the pastry and a recipe for the meat stew and then you do your calculations for those parts there you put it together and then you bake it and that's how we calculate how much vitamin C there is in a meat pie has it to get very little but anyway there'd be a little bit from the onions and the meat stew probably but not much would survive I wouldn't think it gets really complicated really fast and it takes right now we do a full recalculation every night and it takes five hours and it should be able to do it faster and that's why I'm rebuilding it but we're looking for some people that can actually do that complex stuff moving right along how much time we've got here we also do plant breeding and a good example of the sort of scale that we do for plant breeding there's about 5,000 odd little seedlings there and they're actually trying to breed a red-fleshed apple in this particular case don't ask the obvious question which is why but that's what they're doing and so these are all thousands of seedlings with different crosses and they are taking one tiny little piece of leaf off each seedling putting them into test tubes and then putting them through a laboratory looking for particular genetic markers in this particular case it's the genetic marker that produces the red flesh because they found it the people like Ben who's going to talk to you later they sussed out which bit of the genome says that it's going to be a red-fleshed apple and they go and look for it and any seedling that doesn't have it is killed and it means that we don't have to bring hundreds of, well, thousands of seedlings up to the point where it can produce a fruit so to do that you can't do that with pen and paper so we built a database application called Kia it's not an acronym it's just named after the bird and it does that for the scientists they can get their printouts and things and go out into the glass houses and collect their leaf samples and go through the laboratory processes and when they finally get their results back they can then say that seedling dies or that one lives so that's what we do a lot of we do a lot of killing of plants but we're very careful about which ones we kill so that's Kia I'm hoping to have it open sourced I'm told that if there is any New Zealand organisation that wants to use it they can have access to it but I would like to see it properly open sourced and hopefully that's going to happen in the near future it's a Django application against postgres and we're running elastic search indexes we're getting up to over a quarter of a million samples in there the scale of it is starting to get a little bit silly because there's just thousands of thousands of rows linking to thousands of rows linking to thousands of rows it starts to get a little bit silly that's Kia the other stuff that we're working on is telling you about the rain shelter that's a picture of it there it's about 20 metres across there and that's all barley plants and wheat plants and things like that and they've got sprinklers which they measure exactly how much water goes on to it and they've got radiation meters measuring how much sunlight each plant is getting and then when it starts to rain that thing there is actually on rails and it just rolls over the whole top and keeps the rain off it's an awe-inspiring site and when it was built in the 80s apparently there was a lot of money that thing there is a lysimeter same sort of thing but smaller scale and those are half-period buckets where they measure how much water they sprinkle on they also have a mixture of artificial urine that they also sprinkle on to see what that does as well so it's all very odd sometimes but yeah these guys are sampling every minute and these guys are sampling every quarter of an hour but these guys are they've got 1600 sensors in there and they're sampling every quarter of an hour and these guys have got 160 sensors every minute so the data is getting really big they've got a mixture of an old mess spectrometer and those guys are spitting out data at a larger amount all the time as well and basically I'm getting a request from a scientist every week for some sort of database to try to track what they're doing and so I spend a lot of my time saying no to scientists, it's a bit of a shame really but now, it's time to Ben's going to tell you a little bit about genetic science bioinformatics in fact Thank you, Zane Yep, as Zane said, my name is Ben I'm in the Plant and Food Research Bioinformatics team but I also do quite a bit of software engineering on the side so we don't make half-dog, half-birds maybe Zane thinks that's what I do he made this slide but thanks for the pretty pictures of DNA which is what these are so moving along I'm going to talk a little bit about the genetic science and why we use floss so in the bioinformatics team we do omics so there's quite a few different species of omics mainly we are doing genomics and transcriptomics which are the study of DNA and RNA respectively so give you a bit of a background on this DNA is a double-stranded molecule it is the genetic code that we all hear about in the papers RNA is what's made from DNA DNA is like a template RNA is perhaps an instance of a template so there's molecules and cells which know which bits of this to take to make into this and then there's other molecules which take this template and turn it into a protein which is a functional unit which will actually do something so we need to know which bits of this make the bits of this we're interested in like making the apples red so one of the problems is we need to assemble the genome first because the sequencing technologies well the second generation sequencing technologies share up the genome into many small pieces so an analogy would be given in of the same textbooks possibly different editions so maybe just slightly different maybe quite different cut them all into strips put them in a pile like you see here and reconstruct the original texts now if you've only got a small book that's quite doable but the human genome is 3. something gigabases for memory we don't know human genome very well but plant genomes are often 3, 6 fold larger than that so this becomes quite a big computational problem it's huge for the human genome so it's huge for plants so we need compute power so one of the main things we need software for is computation so we have what's called open lava which is an open source job scheduler we use this to assign jobs to appropriate nodes and priority queues this allows us to use utilize our cluster to the most efficient way we also have power plant which is something we developed in-house with most of the credit going to Eric our systems administrator who's in the audience and this is basically our compute cluster it's a shared data store of around one petabyte we have visual compute nodes and physical compute nodes some of which have up to two terabytes of memory available for these algorithms that require a lot of data to be stored in the memory for them to run efficiently we need software for visualization and it's really for the scientists visual representations of data well they enhance our understanding of the data and we also see things in different ways and get new ideas so Ensembl allows us to visualize genomic data well this is written in Perl it's a joint project from the Sanger Institute in the UK and EMBL and EBI in Europe and it can incorporate user data very easily which is important for us to have users to be able to put their own data on without needing to come to the bioinformatics team and extendable and customizable options which means we can add our own plugins, our own analysis pipelines and also share those back with the Ensembl community so here's a screenshot of Ensembl showing us the winegrape genome and we see in three different magnifications here we have top level as the whole of chromosome 7 of winegrape that little square here in red is represented here in this next level and these showing the functional genes that have been identified in winegrape this is the detail view and here we have those functional genes again while some of them, it's just from this small square here blows up to this and here we have single changes in nuclear tides which are the A's C's G's and T's of DNA and this gives us control of a fine grain resolution of what we want to view and is easily shareable and it's really good for our scientists to collaborate we also need software for reproducible research Zane mentioned this earlier reproducible research is necessary for science to work so we call a workflow a recipe for describing how to get from your input data all the way to your results with all the steps in between you can use the versions of the software you used and also probably most importantly why your intent is just as important as what you did a well documented workflow allows this process to be reproduced exactly this is important for transparency the scientific community wants to know how you did stuff verification for the same reason and sanity so you can remember what you did and how you did it and you have to do it again that makes it easy if you know exactly what you did so we use a program called MOA which was developed by an ex-PFR employee Mark Fears it allows extendable templates based on common workflows and this quote from Mark he says MOA hopes to make meticulous organisation of a command line project much less of a burden leaving you to focus on the fun parts so we like the fun parts so we like MOA it integrates with Git which is awesome we can version control our workflows it integrates with open lava we can send jobs to the cluster without having to worry about which nodes we're sending them to it does it for us so reproducible research is assisted even further by Git but we use Git to store our workflows and we use it to template workflows and we branch them to make instances of a workflow for a particular project and that allows us to remember exactly what we did for every project we ran a workflow on and be able to go back to it and Git Hub allows us to share this for collaboration we need software for scientists scientists as Zane mentioned some of them are a little bit slower on the uptake of computing so they don't tend to like the command line they like GUIs so a GUI command line tool Galaxy is what we use we provide this for scientists in the power plant application service it gives you a history of everything you've done which helps again with the reproducibility that allows you to construct workflows from your history you can run the workflows that you've constructed with fresh data it integrates with the job schedulers again so we don't have to worry about where to run these things per user management allows fair controlled use of disk space and it has extendable tool suites which is probably the most important point because we can write our own plugins wrap up any command tools or scripts we want into this GUI framework which allows the scientists to use it without having to plug away at the command line and accidentally do something they didn't intend to do so an example of Galaxy we have here quality checking of probably one of these input datas here I'm not sure which one it's showing and this is a box and whisker plot of the basically what we call the quality of a read at every point which is basically how well the software thinks the machine read the ACGLT how accurate was that base call or was it not and the quality drops off quite badly at the end of these reads here which is actually typical for this type of sequencing we also have here is the history so this is all the input data these are the quality control checks I've done some trimming of the data oops pressed a button and over here we have all the tools available in the Galaxy suite currently so that makes it quite nice for scientists to use so why floss there's quite a lot of reasons we like free Libre open source software in science and in plant and food first reason being open or openness it's a similar philosophy to scientific research or at least I believe it should be scientists like to share they like to be open they like to see what the others are doing enables the verification and the other things I talked about earlier it's current it keeps up with the scientific community part of this is because the scientific community is the other people writing the software half of the time the other reason is it's it doesn't take so long to get things going get things moving and get things updated as it does in the closed source communities community and collaboration and knowledge sharing are a big part of science and also open source so this fits very well it's flexible we can adapt it to our own related slightly different problems because we have the source code we can see what's being done and we can fix little bits up we can switch things around to make it work how we want it to work and then we can push these changes or new functionality back to the community and see how they can make use of it as well and I think again the last point is the most important trust I've heard this a lot this week don't trust what you can't read and scientists certainly do not trust what they cannot read or understand you can't give them a black box and expect them to use those results in the publication if they don't know how they came about so I've recently been asked how does DECQ work which is an R package which calculates differential expression between genes it was easy to find out the documentation was actually quite good so that was a good start it showed the statistical calculations done at every step we were able to verify that the results were what we thought they should be and that they were using the correct statistical methods to get there so this helps us evaluate our science as well so that's why we love floss at Plan and Food here are my references if you're more interested in any of the softwares that I have talked about today or please just come and have a chat with me or Zane so I'd like to ask if there are any questions Hi Zane have you considered using compression for the data logging you're saying you're running out of space but maybe compression is an option I assume the data is reasonably easily compressible the data log of stuff is very compressible and that goes down but that just buys us a couple of years really the volume is going up logarithmically and compression will get you only so far so it's only going to give us a couple of rounds so yes there's places where we are doing compression of course it gets a bit funky when you can compress a genome down real small but when you're trying to do searches, blasts and all those strange things that the genetic scientists do they have to be able to read that compressed data and then it just adds more time and things on CPU is used but at the moment storage is much more expensive than CPU for us so compression is still a valid technique Zane you mentioned that you often have to say no to scientists is that because of enough developer time or is it because you've got enough insufficient storage or compute sorry can you repeat that please you said you often have to say no to the scientists so the request for the scientists because you don't have enough time for storage or compute all of the above there's only one of me really there's another couple of part-timers in the IT department that do a bit of development as well but it just comes down to this I can't clone myself really and that's the problem there's a lot of time and effort to produce another application it's a real shame and it'd be nice to have a few minions if you were a plant we could clone you yeah in Australia we've got exactly the same problem where I think it's pretty common to science in general and I'm really interested in any techniques you've found to actually convince scientists of the value of open source specifically as opposed to just whatever tools you happen to be giving them I think that a big chunk of it is metaphorically nailing their feet to the floor and then actually explaining it to them the big thing is as I said before scientists I mean the ethos of science is the data and so when you actually when you actually present them with the facts and explain everything from first principles they usually get it there are a few exceptions but in my experience scientists usually do listen to reason I mean unlike a manager but scientists are scientists they have to respect the data otherwise they're not a blooming scientist yeah that's sort of what it comes down in my experience you sort of sit down and try and often you do have to explain things from first principles though you can't assume all of the knowledge of most of us in this room who know all about the arguments for floss and why and even Moglin and Richard Storman and all those people that all that knowledge is assumed for most of us but an entomologist or a plant physiologist hasn't got a clue well there's not always I'm generalising stereotyping madly here but that is yep you've got a lot of infrastructure on the premises what's stopping you off-lighting this to Nessie or AWS for the storage side of things we're actually looking at options of working with Nessie in the future I'm not actually sure on the details of that that's all I really know do you have to give me easy money though our main issue is really portability so moving data from us to Nessie and back and forth is actually like the huge amounts of data that need to be shuffled around is from my point of view probably the biggest issue between us using say Amazon yeah it's just the sheer volume of data again trying to move data from one place to another to move things can be overcome but yeah they care but to move a petabot of data from one city to another you probably need a forklift in an 18-wheeler truck is probably the fastest way to do it you'd have to be a very fascinating yeah yeah does it work yeah okay are you from Riennes right cool come and talk to you yeah over here it sounds when you're talking about the difficulty of releasing code that you write is open source it sounded like you're having to have the same argument over and over again with different managers is there not like an institute or ministry wide policy on that sort of thing I'm glad you asked that question no different branches of science have different people who write cheque books sign the cheques but yes as a general once you've got a precedent anyway I think we've just run out of time so thank you very much everyone so on behalf of the team at LCA and all of the attendees we would like to present you with a small gift just as a thank you you can share it between the two of you and I think that's all another huge round of applause to these guys