 So thank you. What I'm going to talk to you about today is a project and I think it quite nicely illustrates how the e-science infrastructure that is being developed by ANDS and other organisations such as NECDA are having a real impact on making data intensive research easier for researchers. I'm Jeff Christensen as I was probably mentioned. I'm a senior business analyst at ANDS and I have a background in biology, in data science and biology. So what I'm actually going to start the talk with is not data but science or biology. So this project is based around cancer and research on cancer and cancer is caused effectively by unregulated cell growth. It's where a cell loses inhibition and grows out of control effectively and will end up forming a tumour. What actually causes this unregulated cell growth is the buildup of many DNA mutations in the chromosomes of a single cell. So this shows here some DNA and these little ACGT bases here, some of them get mutated. This DNA is packed into chromosomes and they're shown here on the left and that's really structures that hold the DNA and pack it into the nucleus. So as I said there could be many many mutations across all of these chromosomes. We have 23 pairs in humans. What on the next slide shows that when whole genome DNA sequencing, well it's basically become cost effective to be able to sequence the entire DNA content of a single cell or a group of cells such as in a mutation in a cancer. And what this here is showing is it's quite a nice visualization of effectively all the mutations in a type of cancer here. It's actually data from a breast cancer cell line so it's not actually from a tumour. It's from a cell line that behaves like a tumour. And the plot here is called a circular spot. But what it shows here is here on the left we have those chromosomes all laid out nicely, one two three four and so on. And here they're just represented going around the circle in clockwise fashion. So we have one two three four and so on around the circle. But what whole genome DNA sequencing has shown is that there's effectively a lot of mutations in these tumours. There can be single base changes as I described just before. That's where let's say an A is changed to a T or a G is changed to a C. But there's also variations in copy number of genes and also structural rearrangements. And some of these structural rearrangements can be quite extreme and they're shown here by all of these red lines and it's effectively where this line here represents that half of this chromosome one has been attached on to chromosome eight in this particular cell. So basically whole genome DNA sequencing has shown, I guess it's shown that these tumours are actually much more mutated than we thought they were at a DNA level. So the project I'm going to talk about is part of an international cancer genome sequencing effort around the world. There's 47 projects currently on the go across 15 countries. A lot of them are in the USA and various European countries. But there's other countries in there such as Mexico and Saudi Arabia and so on and so forth. And in this effort Australia is involved in sequencing two types of cancers and that's ovarian cancer and pancreatic cancer. And I mean this is a really very large effort. They're going to sequence 21,000 tumours across a whole range of organs or cancers from different organs and you can see here a variety of those liver, lung, bladder, blood, so on and so forth. But effectively a lot of types of cancers and then what they also need to do is compare the mutated or the tumourous tissue, the DNA sequence from the tumourous tissue to the non-tumourous tissue. So they also need to sequence DNA from matched non-tumours and that's a lot of data effectively. So as I said the aim is to sequence DNA across all those types of tumours and that what's shown here is another of these circus spots and this is this is some data from a melanoma versus a lung cancer and you can see just at this level that there's a lot of differences in the mutations between this particular cancer here and this particular cancer here. There may also be some common ones for instance this line across here might be the same as that one. So by looking broadly across a lot of different tumour types, the aim is to identify what are common mutations across all cancers. So are there mutations in there that will effectively make a cancer cell? But then because there's a lot of information about tissue specific, there will be a lot of information about tissue specific cancers, maybe it's also possible to identify mutations that are specific to a particular type of tissue. And then ultimately what all of this information is being seen as is generated for is to inform therapeutic treatment. So hopefully better and more directed cancer treatments will be able to be derived from this information. So back to the Australian component, there's two groups that are responsible for sourcing the particular tumours. So the pancreatic cancers are coming from Sydney from the Garvin Institute of Medical Research and Professor Andrew Bianca is responsible for that. He's very much a clinician who is also doing you know biology research. The ovarian cancer tumours are coming from Melbourne from the Peter Mack Cancer Centre and Professor David Botel is responsible for that. Now the other part of the Australian component which is very important is that is actually doing the DNA sequencing and performing bioinformatics. So gleaning information from the DNA sequence and Professor Sean Griman at the Queensland Centre for Medical Genomics at the University of Queensland at Brisbane is responsible for that. So in this Australian component the derived data, so this is effectively the mutation information, it's not the raw DNA sequence, it's derived information and part of the international effort is if there's a requirement that that information is released through a data portal that is internationally accessible. And there's just a screenshot here showing some of the data for the pancreatic cancer here, which you can see Queensland's for medical genomics here. There's some derived data here, so this is showing in a particular gene, doesn't really matter what gene it is, but it's saying that four out of 67 tumours sequence have a copy number alteration or variation in this particular gene. So one goes in here and sees this information and clicks on this, there's some further information here. It lists the ID of the donor and various information about the mutation. So here we can see that in this particular donor and this particular tumour there's a copy number of two for instance. And once again if you click on the donor there is actually some information about the person that this came from and this is a 69 year old male from New South Wales and there's a little bit of information there about the type of patient. But all of that derived data is released through the International Data Portal but the access to the raw data is controlled and that's because of privacy issues and apart from things like somebody's name I mean almost the ultimate in identifying a person is their entire DNA sequence. So that access to the raw data is controlled and it's controlled or only bona fide researchers who are doing collaborative research with these groups are permitted to have access to the raw data. Okay so the sort of broad mutations that I showed a couple of slides back these ones here these are called by an algorithm just runs through the sequence and predicts certain mutations of certain known types but what really is of need for the scientists working on this is that they need to be able to actually analyse the raw data themselves to be able to identify maybe other mutations and various rearrangements and so on and so forth and really it's the scientists the wet lab scientists or the clinicians who would like to be able to analyse that raw data traditionally this would have definitely required a biopharmatician because a lot of the analysis of the raw DNA sequence needs to be analysed with scripts and various things like that but not really this is not really any more going to be the case if a virtual laboratory is used so one of the other research infrastructure developers in Australia which is Nectar is developing a virtual genomics lab and this and project with the DNA sequence of the tumors is very closely aligned with the Nectar virtual genomics laboratory project so what is the Nectar virtual genomics lab well it's it's basically a system that's going to allow DNA sequence analysis software to be stored on the research cloud in Australia and analysed on the research cloud here so there's one aspect and that's a data integration aspect so the data is is accessible for analysis the data integration through the through the for the virtual genomics level VGL is going to be the latest human genome reference sequence there's a another very large data set called the thousand genomes data set and that's where the full genetic information from a thousand individuals has been sequenced so part of that information is also going to be held in the VGL and then biopart form Australia which is an increase capability is is generating data for various framework data sets of reference or of importance to Australia and these are why needs because wine is obviously a large export crop melanoma which is a serious health problem in this country soil because Australia's soil is not best agriculture and forestry and things like that so a better understanding what microorganisms live in the soil is very important to know and the other the other data set that biopart form Australia is generating is for wheat and again wheat of course is a is a is a significant cash crop for Australia and and again to grow in the core soils of Australia better understanding what types of wheat's the genetic makeup of them can can help to increase yields so that's the virtual genomics lab data integration side of things the other part is the analysis visualization and analysis so the software for visualization that's that will be running in the VGL is something called the UCSC genome browser so this is University of California Santa Cruz genome browser and it's very widely used across the biology arena for for visualizing genetic information and it may be slightly too small to see here but this is this is actually just showing one chromosome here and this little red line here is is is blown up into this here and and effectively what you can do is you can have many many tracks associated with this and and these can be selected here and go way down the page but you can you can literally have thousands of tracks aligned to the reference sequence so whilst those circle spots I showed a very very good for showing the difference or the differences in variation within one particular type of tumor the the UCSC genome browser is great for showing variation in in many many many particular different samples or tumors so the UCSC genome browser will be running on the virtual lab and the other the other aspect that's that's very important for the VGL is an analysis package and that's called Galaxy and Galaxy is has it's widely used again in biology it generally would have a local instance of it but what it allows one to do is it's effectively a workflow generator of web services and all the web services do very small and specific I guess bioinformatics tasks but what what a user can do is is make these incredibly complex workflows out of these by saying do this process first and then do this process and so on and so forth so these very complicated workflows can be generated to take input data run it through many many processes and generate output data so that was I guess a brief introduction about the virtual genomics lab and and we're finding another project which is called the cancer genome linkage project and effectively what it is going to do is to allow wet lab scientists and clinicians to allow complex cancer genomics data using the VGL and so this is wet lab scientists like Andrew Bianchen that we discussed before so the software development for this particular and project will be done by the Queensland facility for advanced bioinformatics they're based at University of Queensland the development of this software will be closely aligned with that of the Nectar VGL and that that the Nectar VGL is being developed by Dr. Mike Pheasant at the University of Queensland as well. The data integration aspect for this project is going to be to include the very large raw and periodic data set into the Nectar VGL and this is specifically so Andrew Bianchen and groups like his at the Galvan can can can access the data the project is also going to develop those workflows galaxy workflows that we saw so these will be reusable by the clinicians and what it will allow is much easier mutation searching and analysis of the raw sequence by the wet lab and wet lab scientists and clinicians. What we'll also be doing is minting digital object identifiers for the galaxy workflows so because these are reusable these are and and a lot of the galaxy workflows are very very complex so it's a great idea to have those reusable and have them sightable as well to be rerun on the Nectar Virtual Genomics Lab and and and what it also what it also does is I think allows the workflow to be properly identified and and rerun in a standardized way or to allow users to in to make sure that they're using the exactly the same version of the workflow that was described according to the digital object identifier and then ultimately the software that's developed through the pancreatic data set will be redeployable for other groups and obviously the first one will be people of the groups studying ovarian cancer at the feedback in Melbourne. So that's that's really I guess what I wanted to discuss today and and I think it's a very nice example of a lot of the infrastructure that's been put in place or the e-research infrastructure that's been put in place it's coming to fruition and there's there's a lot of I guess connections and and people and institutions that are involved in this project obviously there's the funders the Queensland government the national government and the NHMRC have all funded generation of this data or generation of the infrastructure that is being developed both Nectar and and have been taking a role in ensuring that the BGL will be up and running there's a lot of institutions who are generating data so the Garvin Institute of Medical Research which is affiliated with the University of New South Wales we have the Peter Mack which is affiliated with the University of Melbourne and then we have you Q who is doing the bioinformatics development or the bioinformatics development along with QFAB and QCIF as well. So what what all of this is ultimately allowing is for data to be generated from that data research is conducted on it and from that knowledge is produced and of course bioinformaticians and clinicians are involved in all of this but ultimately it's a very nice example of this infrastructure actually being used to help people so patients and to ultimately inform potential therapies. So thank you.