 Okay, is that working? Great, okay, so yeah, my name is Camden, and I'm from UCI in Dr. Morozovi's lab, and today I wanna talk about my software, Somatic, which is used for making self-organizing maps, which is a way to let you visualize highly dimensional data sets that Encode uses. So, today I'm gonna talk about a little bit of background, like what is a self-organizing map, why would you use one? I'm gonna do an in-depth description of how you train a self-organizing map, so those that are bioinformaticians will know sort of what the algorithm does. I'll talk about how to use the software to build your own self-organizing map, and then I will do a tutorial on how to use the self-organizing map viewer, which lets you basically supply a website to anybody that wants to look at your data, and then they'll be able to explore your data set at their own leisure, so. So the basic part of the problem is that data sets that are above three dimensions cannot be visualized easily, and humans are pretty good at looking at graphs, or looking at visual representations of their data, and be able to make sense of it, but computers are not the greatest at it, so in order to analyze very highly dimensional data sets, there should be a way to visualize it so you can look at it yourself. So there's already a system for doing that called PCA, which attempts to reduce the dimensions of your data set. The problem with PCA is that it assumes a lot about your data set, that it's a linear set, a space, and also every time you drop a dimension from your data set, you lose a lot of spatial information, so we'd like to be able to do that without losing any spatial information. So that is done through self-organizing maps, which is a non-linear PCA, and I'll talk about how to use them. So Psalms are sort of like a 3D stack of images. Each slice of this stack represents a different experiment or a different dimension of your data set, and each of these slices is divided into a bunch of hexagons, these units, and these represent a cluster of genomic segments, or genes, or go terms that have the same profile across all of the different experiments that you ran. For this particular psalm, we're running it on a toroid, so the top and the bottom are, so this whole thing is edgeless, so if you go off the bottom, you come up the top, and if you go off the left, you come off the right, so you don't have to worry about any weird edge cases or anything like that. These Psalms can be used to mine for interesting results, like in the human and mouse encode papers, we found some interesting results, and then Morzavi in a recent paper also found some interesting results related to using self-organizing maps. So, I'm gonna talk about how the algorithm works very briefly, so the first thing that you have to do is you have to build a training matrix for your data. In the future, my software will include ways of segmenting your genome automatically, but you're gonna have to figure out your own way of segmenting your genome, you can use Chrome HMM, you could just segment your genome by every kilobase, however you wanna segment it, and you're gonna wanna build a matrix which has your segments as the rows of the matrix, and then as the columns, the RPKMs of your particular experiment in that particular genome segment. So, there's a bunch of different ways you can segment the genome, and I could talk about that later. So, the algorithm will initialize a toroid, which is how it's edgeless with your genome segments at random, and then for each time step, it'll take a vector from that training matrix, it'll find the unit in the map that it's closest to in terms of profile, and then we'll pull that unit and the unit's around it closer to that training vector. So, in the end, oh yeah, and then every single time step it'll also reduce the radius and the learning rate. So, over time, the map will train to become like a representation of your experiment space. So, in order to let people make their own self-organizing maps, I created a tool called Somatic. It's built to be very general. It'll work for any sort of coordinate system you put into it, so if you've got genes or microRNA or microarrays or whatever, you can put it into the segmentation column on the first column, and it will just work with it. Also, I built the tool to be sort of hackable in a way that all the output files are there, so if you would like to make a script to sort of throw your own overlays on top, or if you wanna make your own maps customly, that's all supported by the tool. Also, it'll automatically build a website, so that you can view your maps in a visual way instead of combing through reams of data like you would with any other Somtool out there. So, there's some requirements for using Somatic. I wrote Somatic in C++ to try and make it as fast as possible. Obviously, over time, I'm gonna make it more efficient, but as it is, it's much faster than any tool that's out there right now, but you need to have G++ version 2.8.2, you can check that by running that command on your terminal. Somatic has also been built and tested in a Linux environment, but I've heard that some people have been running it on their Macs and it works okay, so that's great. And just know that I'm only testing it for use for a high performance cluster or whatever. Yeah, so the Som viewer itself needs to be placed on a web server. You can actually build your own web server locally on your own computer by downloading Apache and just running it on your local host, but the best way to use it is to put it on a web server for your lab or your own personal high performance cluster, and it'll allow you to share that with all of your colleagues, all of your lab mates that wanna explore your dataset. There's one other little thing that's required for this to work, is the Apache server has to have its directory listings turned on so that the program itself can look through all the maps that you've created and sort of create a website custom every time it's run. So you can download the latest version off of our server at crick.biota.uciddu. If you just go to the somatic part of that address, there will also be a very simple page that describes some previous additions that have been released. It's the feature set, what's coming soon, and that will constantly be updated, especially over the summer, as I improve it with more and more features. So you wanna make sure that your GCC version is higher and is loaded correctly by running the G++ version there, and then you can untar the somatic folder that you just downloaded, and you just go inside of the bin directory and hit make, and it should just build it all for your current system. So there's two files that are required for running this program. The first one is the training matrix that I discussed earlier. The first column, and this is a tab delimited file. The first column is all of the segments that you're trying to run it on. It doesn't have to be genome coordinates, like it is in this example. It could just be gene names, like if you're running from an RNA-seq dataset or whatever. And then all the RPKMs, which is the data you're running it on, are in a tab delimited format after it. And you can have as many columns in that as you'd like, as many experiments as you'd like. Right, so there's an example training matrix inside of the examples folder in somatic if you're not sure about what the style is supposed to be. There's another file that you need, which is your sample list, which is basically the experiments that you ran, which correspond to the columns of the training matrix. And there's an example sample list at the example folder. Oh, be careful when you're naming your samples, because this is the naming convention that the website uses for people to look at. So you wanna make sure it's in a human-readable format. So you just run this script that I've created called build site. It's inside of the scripts folder. It takes a bunch of different options. You have to give your Sama name. You have to tell it where the training matrix is. You have to tell how many rows and columns you want for your neural network. And those numbers depend upon how complex your particular experiment is. Like a very simple, like only a couple of dimensions, you might want a smaller map, but if it's like a super complicated, like 96 dimensional single cell RNA data set or something you might want larger, like maybe 30 by 50 or something like that. There's a bunch of suggestions for how big your thing should be, but because it's a neural network, it depends exactly on your data set, which ones you wanna use. So you wanna try a bunch of those. And then you want to put them the sample list file, the location for that. You wanna put how long you'd like to run your Sama. We typically run it for about four million time steps, but that could depend upon how many segments you're using. You wanna use more time steps for having more segments in your genome. And then because it has a random initialization, sometimes the Sama can get stuck in a local minima. So you wanna run the Sama like three or four times and it will take the one with the best score at the end. So that'd be the number of trials. So this program will run like five or six different scripts that train the Sama, score it, generate your maps, generates a summary, which is a sum of all your maps. And then it creates this website out of a TGZ that's actually in that folder also. So there's a couple of overlays that you can optionally add on top of this. We require some additional files. The first one you can do is you can add a gene overlay if you're using genomic coordinates. So for each unit you'll be able to see what genes are inside of the unit instead of just the genome coordinates. And you can run this script called get genes. It should be get genes.sh, but I must have messed it up on the slide. It's also inside of the scripts folder. And it takes a couple of options, one of which is a gene annotations file. You also need to tell what method of which great algorithm, because it uses great to calculate the genes that are inside of each genome coordinate. And then if you've got like a GTF file which has a very strange like chromosome, like I know like for the mouse GTF file, it doesn't actually have CHR before the chromosome just as the chromosome number. So you can actually like put an option in there to add chromosome to that when you're doing all the comparisons. Otherwise it won't be able to recognize because two is not the same as chromosome two. So here's some directions on how to basically run this program. You can get this particular GTF file from Ensembl and open it up, and then you just run the get genes script on that GTF file using the example that we have. Again, this is a mouse example. So the next thing you can add is a go term overlay. The go term overlay is a little bit in beta right now, but it works okay. So go ahead and give it a shot. You can see the go enrichments that are in each unit instead of just seeing the genes. So you could see, you know, oh, this area is rich in like regulation of heart contractions or something, right? So to do that, you run this particular script. You need to get a couple of files for this script also. One is the gene to go file, which you get from NCBI. And the other one is a gene info file, which you can also get from NCBI for your particular organism. I've also included in the read me file if you don't. So question. So what exactly is the output of your program? Is it a text file? It's a website. It's a website. Yes. All right, cool. I'll go over what the output looks like at the end. Cool, thanks. But the website itself has a bunch of output files inside of it. So if you go inside like the data folder in the website, you can actually look and see what all the maps are, all the genes, all of the go terms and do your own scripts on those. So I built it to be very hackable and open for people to use. Right, where was I? Yeah, so this is for the go term enrichment. And in the read me file for this program, I've included if you wanna make your own gene info file, like if your organism is not supported by NCBI, you can make your own gene info file pretty easily. It tells you which columns my program uses and so you can build your own file. Okay. So I have an example website up at the Sematic website for you guys to look at. It's at example website. There's also a link to it from the Sematic HTML page if you guys wanna follow along. So let me go to my webpage here. All right, so this doesn't look fantastic because the resolution on these projectors are terrible, but when you start up the website, it'll show you the summary map, which is basically the addition of all of the units from all of your different experiments. So you can see which areas are highly enriched across all of your experiments and which areas are lowly enriched. And because it's a toroid, you can use these arrow on the side of the map to scroll it around. And it will keep that the same across all of the maps you look at. So whenever you're looking at map side by side, you know for a fact that the hexagons line up with each other. So like say for example, we wanted to look at like H3K04ME1 inside of whole brain. So this particular setup was a bunch of different mouse organs and cell types, chip seek data, a bunch of histone modifications, like four different histone modifications. So on the right here, we can see all of the areas on the song that are enriched in the whole brain at this particular histone mark. So that's kind of neat. You can also compare that with another whole brain, like maybe H3K27. So you can look at these two side by side and you can see that there's areas that are high in one data set and low in the other one. So like say for example, we're interested in what's happening like here, right? We can click on that unit and we can see exactly what the enrichment level is for that unit. We can see all of the segments, the genome segments that have landed inside of that unit and this file, which is up in the URL is just sitting on your web server. So if you're interested in like doing some more statistical analysis on that, that's totally up to you. You can also view all of the genes that are in that segment and these genes are all genes that have the same profile across all of your different experiments. Which is pretty cool. And then you can also view all of the go terms that have been enriched in this unit that have a higher percentage of happening than the average. So you can see the first term is the p-value, which is corrected with Bonferroni and then you've got all the different go terms that are in that unit. Another cool feature is you can go to the go terms tab which will download all of the particular maps for all the go terms in your system and you can look for a go term that you like. So like say for example, we're interested in neuronal, like regulation of neurotransmitter levels, right? So we can click on that one. We can go down to the map and here's all of the units in your map that have that go term. So you can like look at them. Again the resolution is sort of messing with my mouse clicking a little bit. And you can see that it's higher in H3K04ME1 in these areas and lower in H3K27 in these areas and that might mean something to you for your particular experiment, right? So there's one more cool feature which are these groups. So let me see if that works with the go term or not, I'm not sure. Anyway, so you can like put us name for your group and this will group up all of the Psalms you have currently loaded. So like say that I want to load up the other brain Psalms like that and then I go to the groups and I put in whole brain, right? And I add these selected to the group. So now I can activate and deactivate that whole group of Psalms with just one click. I can set a minimum and max the scale bars that currently are set automatically based upon the enrichment in this particular data but if you want them all to be the same across all of them you can just set a minimum and maximum and then hit set. So let's like set a minimum of like point like 0025 and set a maximum of like 0.022 or something like that. And then you can hit set and then all of the maps will automatically in that group will take that those those scale bars. There's another cool feature where you can see the average of all of those maps that you selected by hitting the average button and it will show you the whole brain average which is all of the enrichments for those particular things that you have selected. So that's kind of cool. There's also a bunch of like figure making tools. So if you want to make a figure for a paper you can like select like the square and make a square and then you can like rotate it and it'll show up across all of the different ones that you've got active and then you can like click on the X and get rid of it and get rid of that to start clicking again. You can make different colors so like say you want a red triangle over these for your figure. So you can just do that and then like rotate it to whatever you want, right? So that is the tool. And then so that I can go to acknowledgments. So I want to thank my my labs especially Dr. Mordorzavi who really helped me with this project. I'd like to thank Ricardo Ramirez and Benny Zhang who did experiments for me to run this data on and then everybody in the lab. I'd also like to thank my Hudson Alpha lead in code production group. I'd like to thank the HPC at UCI and like thank you guys for listening. That's it. There was any questions? Yes. So many problems with the microphones today. So could we look at your toroidal heat maps again for a second please? Sure. So just in general terms like how many windows wide and how many windows high is each one of those? You mean the maps? Yeah. The maps are about half or depending on the resolution of your monitor, right? Like this resolution is very small but normally it fits like a third you can fit like three maps on one screen. No, no, what I'm asking is if you're scrolling over the surface of a toroid. Oh, yes. How do you know when you've explored all the space? So this is just like a rectangular representation of the toroid. So like if you take the top and the bottom and you just sort of wrap them around and you take the left or the right and wrap them around, you make the toroid, right? So this is the whole toroid it's on there. Right, but what I'm trying to ask is if there are no edges how do you know when you're done looking around? Well, this is the whole thing. So you're automatically done looking around it just lets you put the whole thing on it just lets you put the whole like so sometimes you'll have a map that's like half off the screen. So like, sorry my computer's a little slow right now. So like say you get a map that looks like this when you first start out, right? But you know that this here is actually here also so you can click around and put it all like on one area to look at, right? Okay, thanks. So I don't really understand your question. Is that the answer right? I'll catch you later. Okay. So can we do a subtraction between any two groups? That is an oncoming feature. Okay, and the second thing is that can we parallelize it at least if you can run the different trials simultaneously? That's a future feature for that. Yeah, multi-threading and stuff that's all coming down in the future. As it is the training is actually pretty quick. The thing that takes the longest is adding the go overlay but that should be multi-threaded also. So I should change that for that. And because in our place it's we cannot have directory listing enabled. Say that again? We cannot have directory listing enabled. So if you can, any walk around? I'm gonna try and figure out a way around that but sometimes web servers have problems with the programs running on them poking around and it's like security issues and stuff. So I have to look into that and see if that's possible. If that's a serious problem you can actually run it locally on your computer in your own Apache-like local host. All right, thanks. You mentioned the source of the genontology files? Yes. NCBI has nothing to do with that. And because it comes from a consortium that I and others founded about 18 years ago called the Genontology Consortium. You can go to genontology.org. I take your OBO files from that website. But the actual like the conversion between go terms and gene names. You get it from NCBI? Yeah, because they've got a bunch of different organisms already created for that purpose, right? Yeah, which the genontology consortium has created. Oh, okay. Sorry. Is there any other questions? Okay, if there's no more questions let's thank all the speakers in the session. Oh, oh, you mean how the hexagons are connected in the map? Yeah, so the hexagons that are near each other are closer together in the experiment space. So their profiles are more similar. That's just the standard, like it just uses the vectors from the training matrix and trains on them, right? So the training matrix themselves create the profiles. Okay, let's thank the speakers. Thank you.