 Good morning. Good morning. There we go first session first day It's fine. You'll get the just a bit eventually so good morning. This we have a wonderful talk first talk first cab off the rank We have Alan McCulloch who's going to be talking actually. This is really interesting a DNA sequence data and some infrastructure and rivers Using Python and R. So let's make them feel welcome Thank You Katie. Um, yep. I'm Alan McCulloch. I'm a bioinformatics software engineer at NVMA over the hill What's a bioinformatics software engineer? It sounds a bit pompous and made up Means that I'm a not a very good systems administrator. I'm not a very good bioinformatics analyst And I'm not a very good software developer, but I do a bit of all three That's until my boss sees this talk and then maybe pulls me into his office My talk starts with a mystery that I faced in my sister's admin role last year I Was dealing with a user with a bioinformatic computation that which was causing trouble for our servers This is the performance graph showing that it's a user RAM and it's looking like it's gonna crash the server Why And this is actually a small computation so what what's this is doing is it's assembling bits of DNA It's joining them together. It's finding joins, but not not very many of them by the standards of what we usually do And it shouldn't be behaving like this Just a digression and I'm gonna go through and show how We solved the problem A digression to give you some idea of the context in which we're running that Overdome a we have about 1500 CPU cores and one point you petabytes of ZFS based storage for our Science HPC and HT condor management system. We're still running centos 6 reason I'm digressing is because one of the things I like about Kiwi Picon is there's a mixture of cis admin and Developers which you don't sort of often get so I'm just hoping that there'll be a few of you You might be vaguely interested in some of the infrastructure aspects of this centralized config management via puppet and All the puppet stuff isn't as a mercurial So even if someone requests a storage for a project that's all set up in puppet manifests It gets rolled out by puppets so that all of the nodes Get the mount points set up and the auto auto auto Fs updates And so we don't just go randomly creating even storage on a service. It's all centrally configured It works really well It also means that we can capture some better data about projects people are so they go to a sharepoint site fill it in Generates an email the update puppet it rolls it out. It a little works pretty well So yes our aspiration on the systems admin side, especially the previous admin system admin Who I sort of partly replaced who was a much better at it than I am but everything RPM packaged into a young repo And deploy by puppet because what we want is a consistent environment across all our nodes So that when somebody runs something It behaves in the same way. So this is an example of one hour one of our in-house packaged Utilities it's a bioinformatics utility. It's a gene prediction program You just we just get as a table, but we wrap it and in an RPM That's the theory but One of our big infrastructure challenges that that beats me about the head regularly, and this is a bias some people other some other people get beaten up by sort of Funding enough storage stuff like that, but how do we balance providing a secure standardized environment? So all that software works safely and reliably performs well So that's RPM packaging puppet tightly controlled But then how can we be responsive as well? So we often get these requests for for one-off software stacks and they conflict with some of our libraries And what we do make and store we can't just do that and an overwrite system libraries So there's various virtualization approaches virtually and maybe doc We haven't started using that yet because we really need to go to center seven virtual machines and so on But that's a real challenge for us On the one hand we get we get hammered for not being responsive in us But then you know on the hand we've got to sort of provide a robust reliable environment So some of the things approach it some of the approaches we use to try and get a bit of responsiveness and make things feasible environment modules are great Most of these modules that we that you see listed here that are available on one of our servers would have some Python component Those are pure Python those ones there I'll move on quickly from that slide in case you look at some of the version numbers there which the odd ones out of date Another thing that I'm doing lately is is instead of RPM packaging and rolling out I'm doing it just a one thing one nice thing we have is a shared file system So all the nodes met all our nodes mount the same file system So I can install something just on that shared file system and then Roll out an environment roll out an RPM package Which basically all it rolls out is a module file which sets up the user's environment to point at that shared file system in store so This package here is for rolling out our Python 3 environment So you can module load Python 3 do some stuff But the module unload it and get back to the boring old Python 2.6 Which is required to make some other things work So that avoids people completely completely contaminating their environment. They sometimes do and then nothing else works Okay, back to the back to what this talk supposed to be that so what we tried with this was was We couldn't figure this out. We tried running it on a big computer one terabyte of RAM No joy Which I did find I did sort of look at those some of the runtime options and did find a couple of mistakes And I thought they would fix it, but it didn't fix it tried reducing the amount of data. Just use one of the files didn't fix it Random sample a data take a small random sample what happens then does that fix it? I stated it all weekend. I thought surely that last steps gonna fix it, but no That was the result So if there's nothing wrong with computers, maybe there's something funny about the data So what is the data data is fungal sequence data DNA data and some facts about funguses They're very interesting a whole kingdom of organisms in their own right. They're not plants and that are animals But they're very far less studied than plants and animals They're not animals, but really interestingly like some animals. They have chitin in their in their cell walls So chitin is is a lot of what mostly what insects crabs and parts of squid and fish are made of Which is really fascinating There's about five million different species of fungi. They're very diverse mushrooms, yeasts, moles Very common one could be aspergillus, which we're probably all breathing in right now But why we're interested in is because some fungal economy associated grasses which are important to our economy So the picture on the right there shows some plant cells with little fungal high-five growing through them those little wiggly lines and Those funguses produce secondary metabolites chemicals in the plants, which are often actually good for the plants But they can be toxic to herbal She can cattle so there's some interest there and trying to try to make a Ryegrass endophyte they call them which is Still good for the grass, but doesn't poison the stock. So that's why we're interested in that So we want so we think there's something wrong with the data Maybe so the theme of this talk is throwing away information if you look at if you look at DNA sequence data There's a lot of it And the theme of this talk is throwing away information to find some structure in the data So before we tackle DNA start DNA sequence data We could look at say a simple data set links of major north to see how do we typically throw away information? And can we apply that what sort of approach is from the usual methods can be applied to DNA sequence data So this is a toy data set which comes with our their length of the North American rivers We can do something pythonic there and calculate Throw a most of the information and just summarise that by one number the average of the average length of those rivers That's the R code there. It's a bit more prosaic as a statistical language. So of course it's Does things like means fairly succinctly But summarizing a whole that all that data by one number is maybe throwing away too much information So let's throw away a bit less information So we'll do some binning will say, okay Let's let's get some bins of river lengths and we'll count how many fall into each bin And we could do something pythonic there using my favorite Python module at a tools We can we can group by you can divide the length by a hundred to get it into bins and then trunk group by that And we can also do an interest. Well, I haven't tried this before actually that mapping the print function the at a tools group by method returns a series of iterators in each So you iterate over iterators each sub iterator iterates over just that group. So you just need to so to get to the to get the Frequency each bin just convert each of those iterators to a list and then just do a little bit of Pythonism to Make a string consistent of bar characters so you can do a sort of a quick and dirty visualization of that I didn't try mapping print across an array like that before and when I tried it first it didn't work Of course, that's cause print is a statement not a function And so you have to do that little bit of magic there from future import print function And then you can map print across an array which is sort of quite handy if you're if you're into that sort of thing Of course R is quite prosaic about this it just sighs and says look that's a lot of nonsense You're doing a stem and leaf plot. Let's just do that And it's that's how I does it Another example where you another example sort of throwing away information to get what you want of course is screen scraping I wanted to look at New Zealand rivers. So I went to Wikipedia The longest rivers were fine They were in a table that I could copy paste into Excel But the semantic web is not quite here yet because the rest of the rivers were in just sort of amorphous blocks of code But with a bit of Pythonism I can easily and my second favorite Python module the regular expression module I can quickly in one minute get out the rest of the lengths of the rivers of New Zealand and I can apply the same Information reduction. So we've thrown away information to find out that New Zealand rivers are much shorter than North American rivers And also there's a suggestion that they have a slightly different distribution. Maybe it's a bit bimodal That that could make sense. We're a small country, but a long long skinny one Maybe there's some rivers that throw down the length of flow along the length like the y-cater Which gives flows from the volcanic plateau to just south of Auckland and then there's other ones that go sideways Like the South Island rivers on the West Coast Could be a North Island South Island thing or it could be just that this is a pretty pathetic analysis And there is no such bimodal distribution But another common way of trying my own information is to replace data with ranks. It's not particularly useful here But I will use that later to to help analyze this DNA data So let's get on to the DNA data Variation of the binning approach which is sometimes used with DNA data to try and reduce that to throw away some of the Information there that we don't need. Well, we might need it for some things But you want to find some structure is to do a kind of binning But what we do is we look at the frequency of little words So those so on the right hand side there. There's a series of Short six letter words and what we do is we go through our big sequence file And we count how many of each of those sort of words is in there. So Yes those letters are Neucleotide bases in the DNA molecule and they code they do One of the main things they do is every three letters codes for an amino acid That's in protein code in reading and they code for protein. So you can parts of that will translate into protein But parts of it do a whole lot of other things parts of it most of what we don't know what it does to be honest But so I can't see anything there immediately. Oh, if you work with DNA data a lot You can people get to know it really well They might I might see oh, there's a little repeat But usually you have to search against the database or run another tool to find out what what that is Yeah, the sequence one that will be a DNA read that's it that the the that that unit of Information is partly an artifact of the sequencing technology. So for example a chromosome would be sequence one and then millions of letters Sequence one and a shorter number of letters That's an artificial break up just because the way the technology works is it sequences little bits of it And that's why we're having the problem at the start is you have to join all those little bits back together to get your complete chromosomes so Like I said, I'm not a very good bioinformatics analyst So so so what we end up doing though is we essentially we're gonna we're gonna end up converting that we can forget about the Labels of the words we're gonna convert that big file to essentially a vector of numbers Each number representing how account of how many of those little words are in there And so that's what we're gonna do is out to throw it obviously this contains less information than the DNA Itself so what are we doing? We've got a misbehaving computation. Maybe not computers We're gonna look at the data and we're gonna We're gonna usually spectra to characterize the data So here's a here's a visualization of of the spectra each of those well I've called spectra. They look a bit like spectra 16 files and so each column is is is is one of those big long vectors of numbers calculated from one of those files and Each row is a is one of the letters We haven't been able to label all the letters because with with six letters is 4,096 Potential words there. So obviously can't label them all well there are 4,096 Tiny little slivers of color there because I have all plotted those plotted all those so Have we succeeded finding any structure? Yes, we have because the the heat this is this is a heat map by the way This visualization of that of these these vectors And it's been able to cluster them together by finding distance between the vectors and it's it's it's it's been able to work out that It's correctly allocated each of those files to which species of fungus they are so it's finding structure in there So we we're standing back from this data now and we can see some patterns. We can also see now what I've done here as well is Rather than plot the frequency and sort of work in terms of frequency I actually take a measure of information So that's minus log frequency divided by the total number of words So this is a sort of Shannon information idea. So the idea is if you've got a word which is heaps of it It's very common then that gets a low information measure because you it's not very surprising You see it if you see a one of these words as hardly any of it very surprising to see it They'll get a high information measure. So those those white ones there low information measure There's lots of those and that's good because that agrees with some results It's some colleagues of mine found a couple of years ago that epi-chloride epi-chlorifungus has mites Which are miniature inverted repeat transposable elements, which is basically little DNA words which Scattered all through the genome. There's lots of them. So it's finding some structure There's a funny bit of structure down there where we see that there's this Mites, but it doesn't it's not segregated by species So we're saying some of the efforts for example some of the acrimonium files there have got Those were those little words the it's got lots of them, but some I've got few of them That doesn't really make sense. So it's kind of an interesting feature that I didn't really notice it So with now I did this with a python script came a entry came a entropy, okay? So should I have written the script this see programs that can do and Analysis of sequence data into these words quite quickly But they're relatively hard to most software shouldn't be written a lot of Sophie shouldn't be written because there's always something It doesn't that's true It's the the application. I'm thinking of some of the C programs. They're a little bit more tricky to use You have to do some multiple indexing steps There's a few things I can do with this. I I don't want just frequencies I want to I want this information measure to be output. I want to be able to use more than one I want to be able to use multi-processing to speed it up So with this script I can ask it to use 10 processes in each process and will will process some slices of the file The other thing is I want to be able to random sample file because often I don't really I may have a huge file and to get to to get a picture of what it is I don't need to analyze all of it So it's often good idea to random sample it and finally what I want to be able to do is to run this program and say Please process all files in that folder and give me a table Whereas the method some of the other methods you process one file you process another file Then you have to do a whole lot of data combining so so yeah I think there's a use case for it and The the heat map though wasn't wasn't done in Python That was done with the heat map package from a thing called bioconductor, which is which is an R project It's got 1,211 packages Heatmap 2 is one of them as a data analyst you look at that number and you think yay as a systems administrator I look at it and go oh no, but The heat map function does that internal clustering you saw there's a few tips and tricks with heatmaps As you I couldn't have labelled all those 4096 rows So how do I kind of label every nth one because I can set every say 10th row name to blank But then they get reordered by the clustering so how do I sort of make it come out right the trick is to draw the heatmap twice So the first time you throw the image away But the heat map function returns an object which has for example has the indexes of the clustering that's used so From the heatmap object return from their initial call you can munch around and Blank out exactly the right row names and finally draw the final heatmap so that it all looks reasonably nice There's another this is another information reduction of sequence data using a different approach. So here we're not breaking We're not Doing the the words we're doing what I might call a more of a semantic approach So we're taking all the sequences. This is actually not fungal data to be honest, but This is Another it's another data. We want to see what's in there. It's supposed to be all pig data And so we we search each sequence get all the sequences against the database We look at the sequence that it hits what species was that and then we try and infer what species we have in our sample and this the This and then visualizing a similar similar In a similar way though, you're reducing each fault or vector of numbers just the numbers made a different thing So we're able to see from the heatmap that in fact there's some samples there segregating that don't appear to have any pig sequence so One of the things One of the things about the heat maps is that there's a whole lot of arcane options and it can take a long time to get these things right I found so You may you may feel that One thing I've got wrong there is my row My row names aren't aren't large enough and that's certainly true. So there's one of those arcane options back there, which which I need to fix there Now the reason oh the other thing you might notice Well, you probably don't because of the small row names, but apparently we've got some cape elephant true sequence in this data now that's that's extremely unlikely So why do we see cape elephant true there and what is a cape elephant true? This is the sequence data could not have any cape elephant true contaminated contamination in it So with that sort of approach you have to watch out The word of caution there comes from Charles Darwin and Alfred Wallace who discovered the tree of life and the reason we get hits like that Is because all of us It's shrews people sheet goat deer et cetera We all have a common ancestor who lived breathe and enjoyed life about a hundred million years ago And a lot of that a lot of DNA sequence was in that ancestor is still in us The other reason is because when we search these databases the hits we get depend on what's in them And that sort of depends partly on what research projects have been funded So if there's been a whole lot of cape shrew sequencing done in the last 10 years So there's a lot of those sequences in the database we'll get hits to them So this semantic approach does have its problems. There's a contemporary view of the tree of life. Most of its bacteria We're in that little green sliver down the right hand side there And that is a cingy which is an elephant true Um If there's a python conference prize for the cutest slide, then I hope I will win it with the slide Now the the entry and entry well they actually cingy they have to be called cingy now not shrews because They originally might have been called elephant shrews because they they do eat insects like shrews and they have that thing that is reminiscent of a trunk but in fact Ironically their DNA shows them to be much closer to real elephants than shrews So the name is is now cingy with those things Okay, so But we saw that there was a heat map where we'd summarized the six The little six-letter words and a heat map which summarized hits to a database and I won't go into this Notation for obvious reasons, but we can abstract from that that that what we're applying a reduction operator across each word And we're yielding a a vector of numbers For that for that first analysis and then the second one we're doing the same thing except now the count the counting Operator the little eye there is operating over a series of organisms names So t1 might be elephant shrew or cingy t2 might be pagan where we're again generating a Vector of numbers So we abstract so we can abstract from that and and see that there's this thing We do to sequence starter that we do a lot of which is to to throw away information and get columns of numbers and so The the the scripts that I've developed have been engineered into a kernel which And clients of that kernel and I've sort of used in a I've used in a functional programming approach so that the Just sort of a bit different kind of to the way you sometimes do these things But the kernel it doesn't know anything about file formats, which are a big problem by informatics And It it it does it knows about multiprocessing It does have some default binning and passing services, but they are overridden by callback functions supplied by client It does a whole bunch of things like creating results into a table persistent data summaries so that we can We can point the We can point the Program at A file and add new files later And I will quickly Keep going That's a sort of data file we're passing and this is an example of the functional programming approach This is client code create the kernel Um, and we register the current The client defines these methods which my top hit provider might hit provider They're basically generator functions and they It registers those with the kernel and it means the kernel can get on with doing processing just a data stream Without having to worry about file formats. This this is the way the multiprocessing is done slicing up the file I won't go into that. But it does use the itta tools My favorite tools module to slice up the file. So one process one of the processes will process records one eleven Another process will process records two twelve and so on and because most of the Because of our large RAM machines files are cached. They're just really reading from memory So that's pretty fast. So what about we haven't solved this problem yet? Um What I ended up doing was instead of reducing the data to get vectors. I found a way to generate matrices so And I did that by not only counting the words but also including in that their ranks and so that gives us um A kind of an interesting structure here Which brings out more structure in the data um, we can see that the the slope of these Rank versus frequency essentially is what it is Corresponds with the taxonomy of what's in the um, what's in the file Once again that it corresponds nicely with what we saw before But there's this funny turning point here, which comes out really strongly now and that what we've rediscovered here is zips law essentially so where those previous plots were plotting rank against frequency and um As if discovered a while ago when he looked at rank versus frequency of words in English text that there was a power law relationship So you plant log you put log ranked log frequency And you get that slope and that's what we're getting our slopes are the other way around because I'm rather log frequency I'm taking minus log frequency Divided by so I'm using an information measure but it amounts to the same thing so so the mystery was solved by That analysis finding that there was contamination of sequence data. So The graph based algorithm is finding links between all these sequences But because there was this common contaminator in almost all the sequences It was finding links between almost all of them and and basically giving us like an exploding graph structure So I went on like a dog with a bone I went on to apply this um this new um zipfian view to lots of things User came to me a month later with a similar problem and I just ran the zipfian analysis And found that we observed a similar quite interesting sort of Slope the the slope that of this plot of rank versus frequency of these words correspond. This is bacteria now To corresponds to the organism so we can distinguish the organism and there we see the contaminator I straight back to the person said nothing wrong computer. Sorry. It's your data um I was able to help them Honestly a data cleaning which they went on to do um So that's the original data got nasty turning points Some cleaning done, but the database used to do the clean wasn't complete. It's improvement, but not good And finally the final result was really good So that's quite cool because actually to find out whether the cleaning you've done on this data has worked It's it's quite it's quite tricky. How do you know whether there's still some contaminates contaminants in there that you didn't know about Well, you can see it if you do a information reduction like this and plot this because If if if you've got that contaminate, it'll show up these non biologic non biological turning points We can use it for qc of taxonomy. So this sample was supposed to contain Only deer, but we can see that this that that the that this the data segregates into these two slopes So that um, we can go back and say no it must contain something else Um, and it did it contains some cattle the lab confirmed that it contained cattle But it's handy to have this what I call non semantic approach Because as I mentioned the semantic approach where you go and take your sequences and you search a database You get hits to elephant shrew or um singly you get hits to all sorts of things And it can be quite difficult to interpret whereas counting dna words is unambiguous another another application of it. This has got um This has got various problems with that that those are dna repeats So those are like those are not actual non biological contaminants. Um, organisms have these repeat sequences which are littered through the genome like those mites I mentioned and um These weren't supposed to be in the sample, but they were So uh future work, um Soft engineering quality, um, I'd like to I don't know if anybody else will find this useful. I've I've I've But I would I would like to at least make it make it distributable Um, I'm sort of working on that still things like unit tests I'd quite like to do some python based graphics instead of um, ah look at map plot lib um and thinking about extending the The information sort of what I've called a the reduction abstraction. So we started off by generating vectors Then we used this other sort of template to generate matrices And we couldn't do we couldn't use heat maps to visualize those but we could use do these plots But we can extend that let's say we extend that by Um, so we had a ranking and accounting operator there Maybe we could include a ranking accounting and some other operator in that reduction. What would that look like? um And for example, we could apply those operators over more than one set of things like for example, we could the k we could apply that over a combination of um, uh Maybe if the case previously were there was One to end was the dna words and we had another example where one to end was a set of organisms What say we took a cross product of organisms and words and ended reductions like that That could be quite interesting. I don't know how we'd visualize it though And that's That's it Thank you for listening. Thank you for that We have about a couple of minutes for questions actual questions if we have questions Questions do we know how to ask? Ask questions. I think we know how to ask questions. I'm just going to go over here Um, I just wondered why uh, if given the r has some quite nice graphics I wondered why you were interested in getting rid of the r and making use of map It's it's probably it's probably a bit strong to say getting rid of I just I just like to export. I haven't really I I it might be it might be I mean it Getting those plots right. It can be really time-consuming. I suspect it's really time-consuming. No, what no matter what you do um I guess I guess having having if I could It would make maybe my code more easily distributable if I was able to use Generate the plots all in one thing instead of having to sort of Distribute some r code and some python code, but it's yeah, it's not it's not a high priority It's just something I thought I'd explore if I got time Did we have any other questions? Um, obviously being a new zion based scientific What they call your agency organization? Yeah, um, and you've got this body of work How much do you share with other international organizations? And how do you go about that because obviously there must be you know, say an organization in australia or the states doing similar kind of things. Yeah, um, there is a lot of sharing going on. So, um That that that Project there is on our github site and it's a public repository. So it's shared in that sense And we do collaborate as well on on other things Um, so yeah It's a reasonable level of sharing occasionally you do run into sort of IP type issues because some of our research is privately funded so Yeah, but not too much in the software area though really generally Another question Um If I understood what you were saying you'd written a kernel to run this across your Across your clustered environment you're computing Inform, have you have you considered how that could be shared so that I could do some gene sequencing on my computers at home? um share, sorry sharing the Well in effect if I understand that some of your technicians are writing some python code to look into your kernel to do the investigation on their Data if I want to run that same sort of setup Uh, I don't I don't have it the thing that your kernel runs on it can that be generalized? The vast majority of software that we use is not written by us and it is mostly open source So generally speaking you can get you can get the resources you need And most of what we do is to say is is if we do something it's on our github github site generally so um Yep So if you do have if you do manage to get yourself a sequencing machine, you'll be right Got time for one more There we go The university of otago the archaeology department actually came to me the other day and say, you know python Can you help us with some DNA sequencing? I was like I have no idea what you're talking about so You know just yes, I do so my question is Where should we get started if I want to learn a bit more? um In Dunedin is NZGL which has which does sequencing and um supports in its sequence analysis in bioinformatics Do you mean learning or getting resources or? Yeah Yeah, there's there's quite a few people at otago that like in biochem NZGL or give give me a ring at um At imba may and and talks more And we're out of time. Let's all thank alan again for a really good talk. Thank you. Thank you. So we have about 10 minutes now