 I've got, what, five minutes, so this is going to be pretty quick. I'm only going to talk about one of the data capture projects that we've got running at Melbourne because at the moment we've only got one data capture project running at Melbourne. So I thought the best, we've got a whole lot of others that are about to start, but I thought the best thing to do is actually talk about the one that we've been running for all of this year and really just focus on some of the things that we've learned from that project. The project was entitled, this is the title we gave to Anne's. It was Enhanced Metadata Capture for Sustainable Management Sharing and Reuse of APN Histopathology Research Data. Now what does that all mean? APN stands for the Australian Phenomics Network. I'll get to this, my only slide for the day, I'll get to this shortly and that sort of explains the intricacies in which we're working. The major thing I really want to do is just to get to that, what it is that we've, the way we've gone about the project and the sort of things that we've learned through that process and how we're going to be using those learnings in thinking about the data capture projects that we're going to be rolling out over the next 18 months or so, 12 to 18 months. Just to mention that a major partner in this project for us has been Versi and it's been important to have another view and another form of expertise and other set of skills to be able to bring into a project like this. Keep an eye on the time. In 2007 Oliver Smithies, who was at the Department of Pathology and Laboratory Medicine at University of North Carolina won a Nobel Prize. That's his Nobel lecture turning pages and I recommend that you go to Google and look this up, was a combination of, his lecture and his Nobel Prize was a combination of 60 years in the lab and 130 dense and comprehensive laboratory notebooks which he kept religiously. In his, in turning the pages, he actually, his overheads for his talk were all images of his laboratory notebooks which were dense with data. Probably the lesson from that is that he is probably the only person that will could actually understand that data but it is actually an incredibly valuable archival source for him but also for the people that came to do the work that led to him getting a Nobel Prize. His part of his work over that 60 years was on gene targeting technologies that enable you to do specific things with specific genes and a lot of that was basically working in animals and other cellular things. When that was combined in around about 1985 with the emerging embryonic stem cell capability that it also turned up which was done independently which basically means that yet they could grow mice as an example from singular cells. Basically when you brought gene targeting and an embryonic stem cells together it meant you could make animal models of human genetic diseases so that was about 1985. It was quite a long time ago. That work was published in 1987 and by 1988 through collaborations with others they actually had a general approach for producing mice of any desired genotype. Now this was a fundamental breakthrough and by 1991 there was the first mouse model of cystic fibrosis. Now basically what this means is that you can reliably take a specific form of mouse. You can alter it. You can produce a mouse with a disease that mimics a human disease and then you can run tests, drugs and other things against that mouse to see what sort of effect it has. Now that's a whole area of medical research that's blossomed over the last 20 years and at the University of Melbourne and essentially that research area is captured in the Australian Phenomics Network, this community that starts to bring this sort of data together. At the University of Melbourne we have a service that is part of that Phenomics and mouse modeling network which is the histopathology group. Basically what they do is a researcher will come to them and say okay from the people who provide the mice with particular genetic problems they say okay we've got this mouse, we've run these tests, we now need to have the mice killed, cut up and analyse for their pathology. So this is what the Histopathology Service does at Melbourne and this is a service that has evolved probably over the last sort of 10, 5 to 10 years. It had evolved in a way that was a sort of a bit manual, it was sort of trying to figure out exactly how the workflow processes were going to work so there were bits of manual handling, bits of technology that were used but it was all fairly clunky and within the framework that had emerged nationally which is pod the Phenomics Ontology Database project which is an interest funded or neat funded activity. They were looking at how what was happening in hops at the University of Melbourne in that pathology service could lead to the data about these mice flowing into the national framework in a much more systematic way. And so the project, the data capture project was, I mean it's a really good model because we had an established workflow, we had data that was being collected that included digital data plus also slides and samples and other things that one minute. So we had to look at how we could bring productivity gains into that point. And what we've had is a modeller and some technology building, I suppose the key learning was that there are many variables in the model and the most important work we've been doing is actually modeling the true work environment and all the variables that are involved so that we can deal, whatever we build, we'll actually be able to deal with all of the variables that we have seen from the past but we can anticipate will be in the future so that the model we build about the data flow does not constrain or constrict unnaturally the work that has to be done. For example, sort of things that occur in double blind trials. So anyway we've made a lot of progress and but that's all I want to say. I'm directing the data capture projects at Monash University with the funding that we've been kind of given by and so we've actually got eight sub projects and what I'm hoping to do in this session is give you in a background in our methodologies that were adopted at Monash and also like a whirlwind tour of each of the eight sub projects so it's going to be pretty quick. So talking about methodologies and that the things that are important to us at Monash is that first of all the solutions got to be researcher driven. Also agile software development methodologies we're finding that the old waterfall technique just doesn't work. This idea of analyzing at the front, having a good idea of what they want and then sort of building it and then delivering at the end. You really have to build it as you go. Researchers work in a very different way from standard businesses and that. Also working in the research disciplines community so if they've already got something in their environment and some of the products I'm talking about later on that's part of their environment. We have to work in with that. We can't go and build a separate solution that's very different that's got to fit in with their current and emerging sort of environment. Also collaboration with other institutions getting the ideas out that we've actually got here at Monash. We already started to do that with people in Versi and the Synchron and we'll talk about that in a bit in a minute. Down the bottom it's already sort of starting to come up but the basic sort of fundamentals in each of our sub projects is first of all as I said working in with the environment select the appropriate data management solution whatever that might be. One of the solutions that we've got we're working with a product called Amiro as a data management solution. With the funding that we've got we really can't build a separate data management solution for each particular project. Building the appropriate infrastructure and integrating that in with the data management solution. Also building the reuse infrastructure and also adding that in and then finally linking it up to ANS. So that's very fundamental to each of the sub projects I'll talk about now. Climate and weather it's sort of looking at modelling or actually simulating rainfall in the urban area and why this group wants to do it at the moment is their particular output is that they actually want to help with the design and adoption of stormwater harvesting as you can sort of see in the far diagram. That's one example of stormwater harvesting. So they've got all of these huge simulations some of them are taking quite some time to do and they want to put that data out and make it available to other researchers. So that's sort of coming out of our climate and weather. Down the bottom you can sort of see what the emerging sort of solution net CDF is sort of coming out as a very important aspect. As I said walking working in with the researchers and what's important to them and Simon you who's in the audience and I sort of invite you to talk to him later on about this if you're interested. He's basically started to develop a solution along with our senior business analyst Nigel Holgate in that particular area. We talk about that more later. Ecosystems measurements. What this is about is that around Australia we've got a number of these towers that are basically being built or established. They collect measurements in waterfall as well as carbon and the idea is to sort of monitor that over time. Now these have been going for quite some time. They're part of the Ausflux network which is part of a broader flux network around the world. With funding that's come out of turn which is an English course funded initiative. The Ausflux project has actually pushed forward but they're sort of missing that data management solution. What we're actually helping them do is helping them with the capture side lead into a data management solution which in their case is a number of different databases and then making that available to others around the world through ANS and through other initiatives. Molecular biology. This is probably a lot of these projects some of them are actually quite progressed now. This one is one that's sort of doing quite well and we're hoping the others will also do as well as this one. What it is it's an initiative where we've basically gone and worked with Vercie and people at the Australian Syncnetron and continuing to do work in that particular space. What's happening there is in the past our researchers would actually run an experiment at Syncnetron, they'd get their large hard drive, they'd go the Syncnetron, they'd then sort of come back to Monash whichever are in the institution and then access their data there. The problem is that actually find that with these hard drives and that they keep using them over and over again so their reliability would lose out and you know if you've gone and grown this crystal which could take a quite a bit of time in itself and then you basically do the experiment which doesn't take long now on the Syncnetron, it'd be quite a major setback to lose that data. So with this solution which was basically turned on very early this year and is now sort of spreading its wings to other institutions. We know that La Trobe has got an instance of it and I think the University of Melbourne is either establishing it or was almost there. Once again what we're trying to do is make it easy for the data capture side, making it easy for the management side, using building on what the work we've done in my TARDIS which is a relatively generic solution. At the moment it's been applied to molecular biology but it actually can be spread to other different areas. We're sort of finding with other groups such as RMIT. Multimedia collections and Arrow. So what this is about is we've got this large repository at Monash, the Arrow digital repository and it's quite capable of actually storing and providing this research data and we're providing another means of actually making this research data available to it. We started this about two or three years ago with molecular biology when we're looking at protein crystallography, trying to make their data available through Arrow. What this is saying is there's a lot of simple data out there that we can really just clean up on and make a very generic way of putting it in. The photos we're seeing are actually from Kashgar, a photographer, John Gollings wants to make his collection available. It's quite a valuable set of research data. So that's really adding quite a lot there and we can, as far as we're concerned, it's got a lot of more generic uses than just the Kashgar collection. History of adoption, what it's about is we've got this researcher at Monash University, Marion Quarterly, and she wants to collect all these stories about people, how they were sort of adopted and make them available. But in the past, the way they would have collected these stories was that actually get people that actually hire some people, that go out and actually collect these stories from people in the community, wouldn't necessarily be a large or diverse sort of sample and then that sort of provide those particular stories. Now through the internet they can actually capture more of those stories. So what we're using is technologies such as Confluence to help them capture those stories and Confluence is good because it's done through the Faculty of Arts. They've got a lot of experience there with Confluence now and they can actually do a lot of the moulding of the product that we actually give them. So we're sort of providing them really with the underlying infrastructure and letting them mould and style the solution they actually use. So we're capturing those stories from the community. We're making, we're actually storing them in MediaFlex at the moment because MediaFlex gives us the ability of appropriate connections with metadata and then we're sort of making it available once again to the community and to researchers through Confluence. And then the intention is to actually finally connect up to Anne's. Two minutes? Yeah, thanks. Interferome. What this is about is a researcher in oncology who's quite well known, Professor Paul Herzog. He's actually trying to collect all of these interferones. Interferones are proteins that actually help with generating genes that are important for the immune system. And so what he's got is he's got this interferome database and he's looking for us to give him help to actually capture more information from the community. At the moment they're entering a lot of data themselves. I think they've got arenas here. Is it 40 that they've got at the moment? Is that right? Something like that. But they'd like the community to contribute a lot more. And so Paul's basically making these connections in with other researchers and he's telling them about the solution down the bottom right hand corner sort of showing these extra areas where we're sort of hoping to get extra bits of this data put into our warehouse. The data warehouse school base but interferomes sort of looking more at the interferone collection. And as you can probably see there's others. We've got to work in with the environment. As Gavin had you can see GO, another system that's very common in this particular space. So interferomes getting established there and also hooking it up with ANS. That's what this project's pretty much about and making those particular bits of information available to the broader community. Microscopy collects a whole lot of these TIFF images in that and you can sort of see a sample of them in the top left hand corner. And what they're finding is that it's very difficult to manage the research data. All of their students in that are basically collecting a whole lot of different images and it's very hard for them to find stuff. The other thing is that they can't actually search across the whole collection. And what they'd really like is to have this holistic sort of collection with the appropriate bits of metadata and they see a particular abnormality in say a cell. They can sort of find out where else it's happened, data minor, and find out where else it's happened and then sort of lead back one minute or less. And so once again a very sort of exciting sort of project working in with the Amiro system that's coming out of OME, the open microscopy environment. And you can sort of see an example of that down in the corner. So modification of that to suit the what we want to do here. Working in with capturing some of the capture stuff is having happening automatically. So Vishya Garak, who's actually in the second row, is actually doing some interesting work there out of David Abramson's group to try and automate that and make it very easy to collect lots of data for the researcher. The final project, each of the other ones is sort of looking at very discipline specific projects. We wanted something that would sort of focus on the long tail of researchers and could be moved from one environment to one institution to another. And so this is sort of our like that project that's really targeting, you know, it's sort of building this environment that we can easily share with other institutions. And our code name for it at the moment and it may change over time is I share my research. And what it's sort of looking at is taking the work we've done in my TARDIS and actually tailoring it and making it available to other research institutions. So if they don't have a discipline specific solution at the moment, they can actually apply this. And finally, the team most of them are here today, except for Ray who's unfortunately sick. And I'll leave it at that. Thanks very much. My name is Oli. I'm from Versi from the Victorian Research Saatchi Initiative. And I would like to talk about the meta data capturing project we are doing at Australian Syncrotron and then late at Anstow. So this project, the MECAD project, is a joint venture between the Australian Syncrotron and Anstow in Sydney. Okay, so the MECAD project is a joint venture between the Australian Syncrotron and Anstow. It's aimed to improve the meta data management of all the experimental data captured at the Syncrotron and Anstow. Syncrotron is a light source providing X-rays in a various range of wavelengths, having different experiments sitting there at the moment, nine beam lines, producing two terabytes of raw data a year. And at Anstow, there's seven experiments using neutrons. So we want to provide services to researchers to manage better their experimental data and give an opportunity for data search and raw data access to a broader community. We are focusing on developing an extensible search with web-based catalog for these data in an ARCS compatible data repository and also facilitate their harvesting by the Australian research data comments of public meta data. The aim is to increase visibility and the value of these experimental data by improving scientists' own data and meta data management providing a more efficient use of the beam time by re-rows of already existing research data and also enable other research groups to validate already published results by reproducing their experiments. So we have a few challenges to face here, mostly technical, a little bit of human engineering as well. So first of all, we need to design a standardized and automated workflow within the research of the scientists. So the meta data is captured as the data comes out of the beam lines. We need to interface this to the control systems. Once the data is captured and stored, we have several access controls. We have to have several access controls in place so to make sure that the data is not public before the publication is out and also to meet the requirement of the facilitated data is public at least after three years. No, after three years has been collected. We have to have data management tools in place for versioning of the data sets and derived data sets to handle annotations. And we need to create persistent data handles so in case that data is moved to a different storage or just to a different location, the handles point to the new location. And we need to provide download facilities to the raw data which is linked to the meta data. All this needs to be interfaced with the already existing infrastructure of the facility. So there's a proposal and booking database, for example, and a local storage which we need to interface. And as base technology for the meta data catalog, TARDIS has been chosen. TARDIS is a federated diffraction image publication repository developed at Monash by Steve Andrew Larkis. At all. And it's archiving and sharing x-ray diffraction images at the moment from the protein crystallography community here within Australia. And we plan to adapt this to all other experiments conducted at the Australian Synchrotron. It uses the core scientific meta data soft schema to describe experiments. And it's already deployed as the protein crystallography beam line at the Synchrotron. And as the data is created at the SYNC, the meta data is collected within the facility. And Anthony just mentioned this. It allows the researchers to shift through their files and associated meta data and download then parts or the entire experiment to their institution or personal hard drive. And it's already interfaced with the proposal and booking database of the facility. So it collects all the information from the different databases and the raw data and the meta data and puts them together and provides them to the product community. TARDIS itself is based on Django which is an open source of the application framework in in Pison which aims to ease the creation of complex database driven websites. Yeah, that's already it. Thanks, Marvin. I just should mention that unfortunately physics couldn't make it today so I'm filling in. The e-research office is managing the projects but we don't actually do them. So I'm going to have to be very careful about what I say because I have no idea what I'm talking about. Never mind. The Centre for Materials and Surface Science at La Trobe is a major laboratory hosting a lot of equipment. The equipment there, I mean we have very similar issues with the synchrotron that the equipment produces a hell of a lot of data and it has to be managed and there's a lot of people involved in using that equipment. Lots of universities and other organisations. So the problems are similar in that we have data management, data access issues, there are legal issues and the data have to be curated so that it can be usable by other people which is what ANZ is here for. So it's a lovely overlap with the requirements of ANZ. The biggest problem that they've found is in dealing with the data from the moment of creation to some endpoint which up till now is typically a CD or a DVD or something like that. And that pipeline has a whole series of steps to it which is not just analysis but adding a metadata trying to understand the proprietary format which is a major issue in this laboratory with the equipment that it's got and separating out all those elements but even understanding what defines the what is typically called the golden master of the data that has to be kept as the original in the repository from which all other data can be derived. The other problem is because of these proprietary formats the international field is actually kept back from collaborating properly and that's been a major issue at the various conferences and so on that people of like mind go to. How do we deal with the outputs and allow the companies that produce these instruments to feel secure with their IP and their profits and all the rest of it that they can give up some element of their data formats to allow proper collaboration to continue. So that's a very critical point for the lab and they see themselves and I'm talking about the lab at La Trobe they see themselves very much as leading in this area especially with this project. So the ANS project has come along at a perfect time to provide the framework for a lot of this to occur. So we've talked about my TARDIS at the Synchrotron which is not the only possibility I'd have to say for dealing with this data really obviously any database that can handle metadata and so on is theoretically suitable. So we have to be very careful about what is something that has long term value and accessibility and use and so on. But obviously this space is a growing and changing field so we have to watch that very carefully but we're going to have to make decisions now. So I guess a guess is as good as anything in some of this sort of stuff. The other thing is that many of the people who use the facilities at La Trobe have IP and commercial and confidence issues they're using it as a commercial facility. So whatever is produced is almost certainly not a single instance repository which is just available to everybody it will have to be restricted and in fact to make people happy with the way the data is stored and almost certainly have to be a multi-instance database to make sure that people are happy with the fact that their data is not going to be viewed by somebody else that their competitors and so on. But a major part of the data is meant to be out there publicly accessible and particularly the metadata components of that. There's another aspect of the public data in that the whole field of surface science has this issue of being able to compare what are called standard spectra so that if we have the surface of a particular material that there is a world standard for the spectra for that surface there is some way you can go for that and use it in your analysis and perhaps enable you to understand what the surface of your material is based on those worldwide standards. Those standards will only be produced through common goals through working together collaborating with other people. So the standards are obviously a difficult component but the harvesting of the data from the machines to get to all that point is what we're dealing with at the moment. Versi is helping us a lot. Thank you very much for that. We've still got a long way to go though. I think the end result though is that apart from having these databases that can be accessed which is interesting of itself but being able then to have the analytical tools online as well so that people can get in and do what they need to do in a standardised way will also be interesting. So I think a repository is actually only one step in that direction. We really need to think about how the analytical tools can be used as well. That's really about it. Good morning. I don't have any overheads probably because even though I work for the e-research office one day a week, the other six days of work operate as a harassed university professor. But hopefully what are the information I'll give you will be useful and informative. RMIT is in the final stages of approval of a data capture project which is called Data Capture from High Performance Computing from multi-user environments. Its collaborators at this stage will be the NCI National Facility, VPAC and possibly Monash University through Mytardis. The basic motivation for the project was that RMIT has a very strong or a number of very strong groups that research in computational condensed matter physics and computational material science. And their work basically is in simulation and property prediction of materials. They have a significant publication record and they have a very strong ARC grant track record and of course because of that the data that they generate comes under the Australian Code of Responsible Conduct for Research that document was produced in 2007 and that document talks about university responsibilities for curation of data. So this and project came around at a very timely in a very timely fashion. So basically because RMIT has a very strong usage of HPC facilities both at the NCI National Facility and at VPAC what we've done is we've identified a number of software programs that these researchers use. We've identified four possibly five but these programs which are used to do computational simulation of matter are also used by a number of other institutions around Australia. I think three of the four software programs we've identified are used by seven other institutions around Australia and a number of research groups within those institutions. So although it's important to RMIT researchers to develop some software tools to curate data that is generated on these HPC facilities I think that the software tools that we develop will be of interest and of use to a number of other institutions around Australia. The other institutions that we've identified that use these programs are University of Sydney, University of Melbourne, University of Queensland, CSIRO, Monash, ANU, Newcastle, Curtin University and UTS. Now the purpose of these software tools will basically be to interface to these four or five common software programs which are used for simulation of matter. They'll interrogate the output that is generated by these programs to be able to extract relevant metadata. So we'll have a metadata wrapping and then the metadata that's formed and the data itself will be stored in an appropriate format for long-term storage. We expected these software tools will be developed from open source software that's available maybe with some shell scripting around it or something like that. The software tools will be freely available once they're developed to all researchers and they'll be stored in some kind of open source repository like SourceForge or Google Code or something like that. We have also committed to just as an institution store a number of data collections using these software tools on software and data repositories, common software and data repositories. One of the other things we'll probably do with the code that we develop is we'll code in there some kind of time trigger function if it's stored in some national data repository so that after a certain period of time the data would be open to a wider community. We still have to negotiate exactly how we do that but the basic idea is to, as I say, develop software tools which will interface with common software that's available on these state and national supercomputer facility centers and to develop these tools so that they'll interface with output data, create metadata and store everything in an appropriate format for storage. So that's the project we hope to embark on shortly.