 Right YouTube says I'm live So let's do a quick audio check if you can hear me then throw something in the chat So I know that people can hear me I've myself open on another screen as well. So I hope that that doesn't play Or at least doesn't play audio. It doesn't seem so so welcome. Welcome today. We will be talking about RNA sequencing I did a Very quick poll on my YouTube channel and RNA sequencing came out on top. I was hoping that a little bit more people would have voted for the deprogramming language But apparently it's not as popular as expected. So let's switch to me. So you guys can see me as well So building your own pipeline from scratch. I'm really excited about it. I had to do this actually three weeks ago Because I moved so instead of being in Berlin, I'm now a new castle at North Omriya University Still wearing my humble shirt, but that's okay So, yeah, let's just start setting up our own pipeline I've got a completely new setup. So I hope it's audible and Stick to your roots. Yeah. No, I hope it's it's it's audible and it's good because I've been Fidgeting a lot with the whole setup thing It's difficult, right? Because you have all of your equipment setup just the way that you want it And then all of a sudden you move to somewhere else. So everything has to be packed. Everything has to be unpacked So I hope everything works and that it's gonna be good So if there's any problems with the audio, let me know as soon as possible because then I can can fix it All right, so for today, I think that the main thing today will be setting up a Linux environment And we're going to go through it step-by-step. I want this to be a very Useful tool for other people when they set it up So of course there will be some steps since I'm streaming from under windows I of course have to set up virtual box and these kinds of things first But I will take you guys through all of these steps. So we will actually be installing Linux as well And then we will start setting up every tool that you need. So there's going to be a lot of tools So if you want to follow along with me, I've put all of my code and scripts that I'm going to use today as a guest Hi, sir, I've done RNA sequencing pipelines with my master of science and we did it all on galaxy the website I am following your course to learn how to do it with command lines and be able to value this knowledge. Yeah No, I think that there's many different ways of doing it I posted it on reddit as well and people there said like oh, why aren't using not using next flow or VSL 2 which is the integration with Windows and I just decided to do it this way because it allows me to Guide you guys through the whole process That being said, I don't think that we will align a single read today because the setup itself takes a long time There's a lot of different tools that we need And I hope that we can finish today with indexing the genome and That would be good and of course Perhaps downloading some reads from the short read archive to for the next step Build our own pipeline So like I said, if you want to follow it along This is the link. Let me put the link in chat as well so that people can just click on it and So this is the link to the hit it's just a gist so there are three files in there and It will allow you to kind of copy paste in the commands since the PowerPoint is not available I will make the PowerPoint available later on and then you guys can just well click on the links in the PDF file But for so far we just gonna do it like this That's it Good, so when we want to start our own Linux environment in Windows We have multiple options, so we can think we can take things like QEMU. We can do virtual box There's the Windows subsystem for Linux as well But I thought it would be good to just start off doing virtual box because it's relatively easy and it nice nicely containerizes the the Linux system from the Windows system and in that sense, it's also very clear for you guys So for today, I decided to go with Debian. So Debian is a slightly Unusual operating system. I think to use for bioinformatics Because generally people use either Ubuntu or CentOS It has many more tools built in But since I didn't want to use any built-in tools, but wanted to download and install all of the stuff from scratch Like the title says building your own pipeline from scratch We will be starting by virtual box and Debian So let me actually switch to my layout so so that you guys can see my desktop. I hope this is clear I hope it's Visible, but I've downloaded a couple of files already just to prevent like long download times So I have the virtual box I have the virtual box extensions and Debian and so the net easel, which is a very small file. It downloads everything So I would advise anyone trying this at home to use the net easel because then you can just download all of the packages as we go But if if for the for myself, I've downloaded this 3.8 gig image So the installation doesn't have to use the internet. So it just goes a little bit faster Come on in chat. This is a new field for me, but our lab is ordering oxford nanopore It's got an opportunity to learn from you. Well, yeah If you have nothing set up, then this is it. This is this is the lecture for you so to speak All right, so let's quickly go through to the power point So we will be using virtual box. I will be installing Debian Of course, you can use any win any linux version that you want But I went for Debian because it's a little bit different There are a couple of things a couple of quirks which we will get to but Remember everything or every time that you see Danny somewhere, right? If you're installing linux, you have to choose a username and of course you are free to use your own name You don't have to use my name for it Good, so let's switch back to the desktop and we will first start installing virtual box So just double click the installer. I'm going to press yes because it asked me if I want to be an administrator And I'm just going to say next. I think everything is fine. We're just going to leave everything on the default settings default location I do not want a shortcut on the desktop You can be in a quick launch bar And you can be in the start menu no problem. So networking features will reset your network connection temporarily So I hope this works. I hope this doesn't break the stream but Let's see what happens All right, so and then we just press install and it will start installing So I'm hoping that it's going to automatically reconnect the stream and not going to cause any major issues So start it Seems to have no issues. I think at least I hope I'm still live All right. So once we've installed it, we get this oracle box vm thing So we have to start making our own virtual machine So for that, I'm going to go to machine I'm going to say make a new machine and the name of it is going to be devium Because then it will recognize that I'm going to install a 64 bit devium and here you can Choose the machine folder. So we will do a whole virtual machine Is there an advantage of dual boot over virtual machine? Yeah dual boot gives linux direct access to your CPU so by doing it like this you're in an environment within windows So it will be a lot slower Not that much slower virtual box is pretty good But it's not as as bare to the metal so to speak If you would do it to bare metal, it would be quicker I'm not only that but you can use all of your different cpus Right because that's one of the disadvantages of using virtual box Is that I have to select how many cpus I'm going to use and of course I'm streaming as well So I'll probably leave it to one or two All right, so recommended memory. I'm going to allocate at least a little bit more So I'm going to use four gigs of memory. I have enough memory in my machine. So that shouldn't be an issue I want to create a virtual hard disk now And then what do I want to do? Well, let's just go for the virtual box disk image The other two allow you to use it in other virtualization software But in in this case, I think it's fine because we're just going to use it here So I'm going to do a fixed sized allocation So this will take a little bit more time upfront to create the hard drive But it will be faster as we are using linux. So I think that that will be good So here we have to specify how much memory we want for our virtual box And this has to be as much as you can spare But for the sake of of this tutorial, I will do 32 gigs So it's going to take up 32 gigs of my hard drive. So when I press create it will start making a big file On my hard drive, which is kind of a hard drive where everything is on All right, so this will take a little bit of time And this is one of the things that you will have to get used to when you do bioinformatics Everything will take a little bit of time and all of these little bits of time will add up It's just two minutes remaining. It's not going to be that bad But it is going to take a little bit of time So if you would have selected the dynamic allocation, it would finish instantly But then every time you write a file on the linux It means that it has to make the hard drive a little bit bigger. So you get a lot of slowdown in in in the end So we'll just wait for this process to finish a little bit Because it's just the way that it is and this is something that for bioinformatics is very very common You're going to sit around a lot of the times just doing nothing Talking to colleagues while you wait for the computer to finish And of course in this case it is also having an issue with the fact that I'm streaming it Can I boot my PC into Ubuntu and follow? Yeah, of course Yeah, if you are in Ubuntu that works perfectly well, you don't have to set up virtual box in Debian Ubuntu also uses the up-get system. So it has the same package manager. So everything should be perfectly fine So while we set up the virtual virtual hard drive and the virtual machine Just boot into linux or boot into Ubuntu and just follow along when we start installing programs there All right one minute 39 seconds remaining I tested it out yesterday and Without streaming it was a little bit quicker, but of course now the OBS is also taking Is this installation process work with mac? No with mac you can actually do the exact same thing So just download the virtual machine You have to download the mac version of course for virtual box Let me actually while this is running. I can pull up the firefox thing So that is me which is a little bit annoying That's not something But if you go and you just type in virtual box Then if you go to virtual box, then it says oracle vm box And then here in the download section we can go to downloads Then here you have the the windows host and the os x host So just download the os x version Remember that we also need the extension pack. So the extension pack is here So you can download the extension pack from there and the extension pack will allow you to copy paste From one operating system to the other Which is really really useful. So there's there's different Different builds available. I am using virtual box 6.0 So if you just go to 6.0, then you can see all of the different downloads options here So if you are on mac, you can use the os x if you are on windows use the windows one and for all of the different Distributions of linux, there's also linux versions, but if you boot into linux Then you can just use the linux one So this is the one that we can download. Um, let me actually show you where you can get the uh, W net easel as well. So you just type in W net easel and um, it will tell you The first google link says minimal install from cd And then you can download the cd here. Um, just take the amd 64 version, which is the 64 bit version And then it's like 300 mbs of download. So it's a little bit of downloading Virtual box itself is also like 100 to 200 mb All right, very good. So let's go back to the desktop and it's almost done All right, so once it's created the virtual hard drive You can see that it actually directly switches to debium, right? So now we have to make some settings because um, I want to go to settings and I want to kind of update it a little bit So I go to advanced and I say shared clipboard I want to have this bidirectional which means that I can copy things in ubuntu or in debium and Paste it in windows or copy things in windows and paste it into the virtual box This cannot be done directly. We first have to set up some other stuff And these are this is where the guest software and the extension tools come in We go to system. Um, this looks all fine I do go and to the processor tab and say start using two processors just so that the installation runs a little bit quicker You can actually extend this PAX NEX Which will allow you to it it will it will make the virtual box run a little bit closer to the metal But i'm not going to do this because it has an impact on the streaming performance So after we set this we can go to display you can say well Use a little bit more video memory because we are going to start a desktop There's only one monitor, which is fine And i'm going to set this to virtual box vga and this has to do with The fact that otherwise your mouse cursor might be a little bit messed up I'm going to enable 3d acceleration. I'm going to do that although it will or it might impact streaming a little bit But this is just so that the operating system runs a little bit faster So now the most important part is when we go to storage We can go here and we can click on empty, right? So we see that there's a cd drive In the virtual machine and the cd drive is currently empty So what am I going to do? Well, I'm going to mount the iso The debi and net iso that I just downloaded. So this 300 mb file I am actually going to mount the 3.6 gigabyte file because it has all of the packages So I'm just going to say choose a virtual drive and I'm going to go into my downloads and here I am going to select the dvd So when I click open it will now tell me that on the hard drive or on the on the cd run player in the virtual machine We have the debi and iso mounted All right, so that's everything for now. So we just press okay It tells me that there are some invalid settings detected On the bottom. What is that? No, it seems to be okay All right, no problem. So we just press start and it will start the virtual box And that will take a little bit of time. So here we go. This is our virtual machine So this is our kind of computer in a computer and here we are presented with the debian graphic installer So we can install it via the non graphical installer, but for the sake of of completeness or beauty We are just going to do the graphical install Can this analysis be done in the windows subsystem for linux as well? Yes You can do this in the windows subsystem for linux as well But then you have to enable it and install it and then use this So yeah, after all of the things that we will be doing in the virtual box The same commands can be used in the windows system Um Make sure that you use a debian or ubuntu type operating system Because most of the commands that will be there are using the apt package manager So you have to have the apt package manager It's not a big deal if you use fedora or something which uses a different package manager But then the names of the packages might be a little bit different All right, so i'm going to um take english as my language because i'm speaking english quite well I am located in the united kingdom, so i'm going to select that I'm not going to use british english. I'm going to use american english because my keyboard is american english So I have a standard Usa 101 keyboard layout. Of course if you have a different keyboard select the keyboard that you have All right, so then it will start scanning and doing some additional things like setting it up And then it will continue with some more questions So the network was detected. Hopefully. Yep perfectly fine All right, so it's doing the configuration of the network So this should all work out perfectly fine because i'm just connected with a internet cable Um, the hostname can be debian or you can choose whatever you want. You can choose a different hostname like my RNA sac machine or whatever, but i'm just going to go with the default The domain name just leave it empty since i'm not on a domain This matters for example when you're inside of a university and your university has a top level domain But in this case i'm at home, so I don't have a domain running here All right, so now we have to type a root password So i'm just going to go with the default password of one two three four Oh, I have to do numlock one two three four Um And in theory, this doesn't matter too much, right? Just don't leave it empty because it will complain about that So i'm just going to take a password, which is easy to remember All right, then it asked me for the full name of the new user. Well, I am Danny Arans, so we'll just fill that in Hello, I am trying to install msa package in my r studio, but i'm encountering an error I'm google search, but all to no avail. Can you please help me out? um Yes, send me an email and send me the warning and error that you're getting We're not going to do it now because we're going to set up an RNA pipeline and that's going to take enough time already But um, I can help you out Just drop me an email on my gmail Good, so then we have to Select the username so the username Danny is going to be perfectly fine and I have to choose a password for myself So I'm going to take the one two three four password again and I'm going to verify that with one two three four All right continue setting the clock detecting the disks that all should be perfectly fine and then it will start the partitioner so the partitioner is A big deal in Linux everyone likes to set up their Linux machine in the exact same way or Different ways depending on what their flavors are, but I am just going to say guided use the entire disk So I created this 32 gigabyte disk and we're going to use all of it I'm not going to dual boot so in case you're installing debian or ubuntu yourself On your windows, then this is the critical part because this is the part where you could override the current window system That is on there So if you're making a dual boot then at this point you have to be very very careful But since we're in a virtual box, I can just say use the entire disk And then it tells me that there's one disk drive, which is 32 gigs big I'm just going to say continue and I'm going to take all of the files in one partition Normally if you would do a dual boot system You probably want to separate the home the far and the temp partitions and resize them the way that you want to But in this case, I'm just going to use the whole thing and it's going to be one partition And it's going to be No issues whatsoever And I'm then going to click on finish partitioning and bright changes to disk And then I'm going to have to confirm that by clicking on yes Otherwise if I leave it on no the default setting then it Unfortunately will go back to the partition. So I'm just going to click yes I'm going to press continue and then it will start partitioning the disks and it will now install the base system So it will install a very minimal linux on on the hard drive that we just created All right, this will take a little bit of time not too long, but It's it's just going to take a little bit of time And again, like the more cpus you allocate to your virtual box the quicker this will run But I only allocate it to because I also need a cpu for rendering the stream I also need a cpu for running the firefox so that I can see you guys chatting So it it's going to be a little bit of a Problem so that's one of the things of course in that if if you want to do RNA sequencing For real z's so to speak, right? So then of course install linux on your system directly So then linux can use all of your hard drives So if you're aligning reads and then you can use all eight cores or 12 cores or the amount of cores that you have in your system Currently because I'm running in a virtual box two cpus is the maximum that it can use So it's it's going to be a little bit slower and a little bit limiting All right, so installing the base system I actually notice a slowdown compared to yesterday when I was testing it out So it is it is not as as good as All right, so it's installing app armor, which is the kind of default firewall And then a little bit of extra packages and then it should all be fine All right So now it asks me if I want to scan for additional media, but I already have the whole like the the 33.6 gigabyte image. So I'm not going to scan for additional installation media It could be that you have like a two cds or three cds installation Then here you would click yes, and you would put in a new CD-ROM into the hard drive or into the CD-ROM drive, but we're not going to do that. We're just going to continue Use a network mirror Yes, I am going to use a network mirror Because we do want to get the latest updates for all of the packages Plus by using a network mirror. It's also going to set up my My eventual installation to have to have a network All right, so we have to figure out where so I'm going to say I'm in the united kingdom And then it tells me that this one is the closest one, which is perfectly fine. So we're just going to use that We do not need an http proxy So in case you're behind a proxy server fill in the information here, but I'm directly connected to the internet So I don't have to so it's updating up, which is the package manager And then it will present me with a choice and this choice is going to be what do you want to install So it will retrieve some files from the internet make sure that all of the packages are to the latest version And then I can Do I want to no, I don't want to participate in the popularity contest You can of course if you want, so I'm just going to install install the desktop environment We're going to use gnome for that and we're going to install all of the standard system utilities So the default settings are fine. I'm just going to press continue I'm just going to press continue and now it will start installing linux. So it will retrieve all of the files. In this case, if you are using the net image Then this part is going to be Taking a long time because now it needs to download all of them in this case it Like most of the packages are on the cd So it finds all of them and as you can see it's now downloading or it should download four additional packages Which are updates Since I downloaded the dvd because the dvd is a little bit out of date So let's hope that this doesn't hang too long All right, and let's just have failed that doesn't matter We can just press continue and I just want to install the software. So don't don't worry about this part Do I want to install security updates? I understand how busy you are. Can you help me with a tutorial co-expression project using plans in the acse study? um I might but Like I said, I'm streaming now. So I'm trying to focus on the rna sec thing and Just send me an email if I have time I will I will definitely reply to all of the emails that I get and based on my time We can see what is possible So no guarantees, but I'm always interested. I do not want to participate in a popularity contest Yes, this is what I want so I don't know why it failed to download those four additional packages, but we'll just try it again So now I directly found them. So it's probably just a very temporary internet issue Which sometimes happens, right? If you're installing and it uses the network during the installation It might not find all of the packages in one go. It might be that there's a small hiccup on the ubuntu side But at least we're installing now. So I checked it yesterday. It will take around five six minutes So for that, of course, we can just have a little bit of music while we wait So I'm just going to start some barn music and then Oh, that's wrong. That's wrong. That is really really wrong. Why are you playing on the wrong audio speaker? You should be playing on this one and then a little bit of music That should be a lot better audio wise All right, so Do your thing And again, like I said bioinformatics is a lot of this it's a lot of waiting and Make sure that you always have your favorite like playlist somewhere So you can just put in some headphones and listen to some music while you wait This one is actually pretty nice So i'm actually using the stream deck to play music Which allows me to play copyright free music, which is really nice. I really love this stream deck. I'm so happy that I got it so We're already 50 done. So I'll take a little bit of time So if you already have a linux machine, then of course this part is going to be a little bit of boring but If you don't and you just want to follow along Then this is the time to catch up Because and just give it four cpus and it will install a lot faster than mine is And of course, like I was thinking about when we've everything recorded I will probably cut out these pieces. So that When this video or the video of the live streams will go live Um, of course, we're not sitting here waiting for like 10 minutes for ubuntu to install that doesn't really make that much sense. So So yeah, if you're in chat, let me know we can talk about anything now like, uh The msa package, uh, like I said, just send me an email. I will always reply to emails. Um, and uh There's any other questions so far just throw them in the chat and if you're just here hanging out then, uh Throw something in the chat as well. It really helps out for the youtube algorithm The more people are active in chat just saying random stuff like, uh RNA sequencing It helps youtube to say well, oh this stream is doing really well like there's a lot of people interacting And I'm trying to get monetized. Um, and I'm almost there. So fortunately, we don't have things like super fans and all of these things In the chat yet. Um, but I'm relatively close to monetization now. So I'm hoping that uh You guys can support me and uh I've been very very happy with the whole youtube thing so far Like it's been overwhelming the amount of positive reactions that I've gotten to the lectures that I posted So that actually decided me to do this right to continue Streaming in my free time and teaching you guys things like this RNA sec And I'm still thinking about doing a deprogramming language course Because the deprogramming language it's a really really good language. I really like it. Um, and I think everyone should learn a little bit of D and It's nice because I'm Danny So it's good that the programming language starts with the same letter as my first name Yeah, so the streaming does have an effect when I did it yesterday It took around five and a half minutes to install Linux, but it's still not too bad, right? Imagine installing windows windows on average takes like half an hour to an hour to install and uh What will be your advice for a plant breeder that wants to focus more on computational biology and bioinformatics at phd level? um Well, what will be my advice like what kind of advice are you looking for? I think that it's always possible to Focus on computation because a lot of the research groups that you will end up in doing your phd ad They have a need for bioinformatics like nowadays anyone who can program and is working in biology is a very very valuable asset I would learn a lot of genetics because in plant breeding genetics is key, right? Genetics is very important um I would focus on statistics because in plants the Large sample size that you have is a big advantage So it will allow you to do some very very fancy statistics What skills will be valuable? Um, well, um linux Shell scripting A little bit of r or python. So one of the two it doesn't really matter too much But I would also look To learn a little bit of c and c programming. All right Let's stop the music. We're done. So It's installing the bootloader so we can restart But that those are the skills, right? So the best skills that you can have are just Being able to program. So that doesn't matter if you're using r or if you're using windows or something else Shell scripting is really useful the ability to use linux One of the things that is A very good addition as well is things like high performance computing So learn learn something like a q-submission system like slurm or q-sub or The other systems which are on use on clusters All right, so it seems that your installation is complete. Um install grew up. Yes, we want to install the bootloader We want to put it on the hard drive and then we just want to press continue And then we should be more or less almost done Um, there's still a couple of little things that it will finish up But then we have our first linux system running Very good. And of course, this is a really nice way to learn linux, right? Just putting it into a virtual box allows you to play with it without Um having to risk Deleting windows or these kinds of things. All right. So installation is complete So we'll just press continue and it will finish and it will reboot So for um, fortunately for us virtual box will actually notice that we have installed an operating system So it will automatically unmount the drive the installation drive So we will just choose devian And the first things that we're going to do is actually Boot into devian and then shut it down Because we want to now install the virtual box extensions So to be able to use higher resolutions and have drag and drop and these kinds of things working Numlog one two three four, that's our password and Then we are in devian. So it looks really nice. Um, but we are going to shut it down first So I'm just going to say power off and All right. So now in virtual box, I want to install the Extension pack, right? I told you guys that not only download virtual box, but also download the virtual box extension pack So I'm going to go to file Preferences and then I'm going to go to extensions and then I'm going to say plus And then I'm going to go to the software that I downloaded and here we have the file that you have to download as well So it's the oracle vm virtual box extension pack 6.1.38 and this has to match the version of the virtual box that you're using So if you're using an older version of virtual box, then download the corresponding extension pack So I'm just going to install it. I have to scroll all the way down to press I agree and then it will say Installed successfully. So there it is. All right. So I'm going to press okay And then we are going to start our devian system again Just to make sure that Because now we have we now have installed it in the in the host operating system And now we are going to install the same thing into the virtual machine as well So that the two can talk to each other and that we can do things like copy paste But also make the make the screen bigger Run in any resolution that we want So my password is one two three. Oh I'm not that's wrong One two three four That's it All right. So here we have our devian and now we are going to go to here and we are going to say devices Insert guest additions cd image Now when we go to activities Then here when we go to files, we now see that we have an additional Hard drive uh additional cd run mounted and this has the additions So now we run into one of these little things. Um, so let me actually go back and show you guys the power point So the problem here is is that by default when Debian installs It does not give us pseudo rights So we are just a regular user so we can do two things, right? So we could always say well, I'm going to use root for everything Which you should not do or we can make ourselves pseudo errors Which means that if we want we can execute commands as if we were the super user So let's just go back and I'm going to make ourselves in the pseudo error group So I'm just going to say activities I'm going to search for terminal and first things that I'm going to do is I'm going to add the terminal to my favorites And then I'm going to start it up So now next time when I go to activities the terminal is going to be here And don't worry once we've got the extensions working We can scale up this window and we can have a normal resolution instead of 800 times 600 So I'm going to switch to the super user the super user password That's the one that it asks for now in our cases one two three and four Which is the same so now we are root and now we can do anything on the system So I'm going to say user mod Minus add group and I'm going to add the pseudo group to my own user And if I wanted to check this I can issue another command I forgot the command that I can use to check this but It should be okay. So I'm going to say exit And then I'm going to reboot the virtual machine just so that it updates my user account because I'm logged in It's better to just quickly restart the whole machine to make sure that it Updates my user account to have pseudo rights and it starts quick enough. So that's not an issue at all It's one of these things that I love about linux linux is really really fast and really responsive in these kinds of things Good. So I'm going to add my password And now when I open up a terminal, which is put here I can now just say pseudo soups And it tells me that I have to think and respect I'm going to give my own password and now you can see that I can directly switch and be a root user So I'm going to say exit And we are going to go and install the virtual box extension. So let me see if it still has the cd mounted I'm going to go to files. So the virtual box is still mounted as you can see So we're just going to go there, right? So I'm going to go and say Let me show you the slide first just to make sure So the extension pack we did that. So I'm just going to set up the guest edition So I'm going to go to the cd rom and then I'm just going to execute this command to make sure that Inside of linux. It also installs the additional software So let's just do that. So I'm going to go and say cd slash media slash cd rom zero I'm going to do ls and it did not mount the cd rom properly. Let me actually see why not It might have mounted it under a different cd rom. It might not be cd rom zero It might be cd rom like one or cd rom two or something like that Anyway, let's just move this aside and see if it found it See if it mounted it under a different cd rom It did so it mounted it on cd rom instead of cd rom zero So there's sometimes little differences and that's just because of little changes to the system So we're going to issue the command. So we're going to say sudo so switch as a super user Do an sh shell and then within this shell execute vbox linux additions additions dot run All right, so it will install it And now we should be able to copy paste from windows into linux. Um, so let me quickly test that So let me get a Screen like this and I'm just going to try to copy code available into this window by saying right click and it doesn't work yet That's a shame. Um, I'm going to go and go to devices shared clipboard disabled devices shared clipboard Bidirectional, I'm not sure why it doesn't allow me to copy Nope Unfortunately, it still doesn't work All right, but it was oh It says here running kernel modules will not be replaced until the system is restarted So let's do a restart then just to make sure so we're going to go here We're going to say power off and restart the system just to make sure I'm just going to keep this on the side so I can just see if copy pasting works All right, so let's log in again All right, um open up a terminal That's there and I'm just going to say copy paste Yay perfect So that works and one of the things that will also work now is I'm going to display settings And then I want to have a bigger resolution so you guys can see it a little bit better. I think that one should be Okay All right, and then this one can go back to the other screen All right, so now we have our working linux very nice very nice very nice Good, so that is part number one getting linux installed So of course if you're there and you already have linux installed Then it's not going to be that much work But if you're under windows, then this will allow you to install a virtual box all right Next part let's go back and see what is the next step good, so The rna sec pipeline looks like this and if you're very interested in rna sec and all of these things regarding rna sec I already did a lecture about rna sequencing in general and about How sequencing works so those are in my bioinformatics course But this is a slide that I took from my bioinformatics course And I just list here all of the different things that need to be done And then here I'm listing the programs that we need for this So for read trimming so when we have an rna sec read we need to snip off The the bad quality parts We also need to snip off the adapters Because when you do rna sequencing you generally ligate adapters To your little piece of dna or rna sequence And we're going to use trimmomatic for that. So for the alignment we need to use a splicing aware aligner So we can't just use bwa Bwa is a dna aligner So in this case, we are going to use the star aligner So the star aligner as an input requires two different files one of them is the genome and one of them is the transcriptome So it is aware that Reads in rna sec Do not directly come from the genome, but sometimes span an intron axon boundary. So we will come to that when we are there We need to remove duplicates. So pcr is a step So we need to remove pcr duplicates from our data Which we can do using pcar tools We need to do indel realignment based recalibration So that means that if we have a known snip in the genome We need to be able to handle those and we need to be able to Make sure that reads that have a snip in them a known snip do not get penalized for that And then we're going to use bed tools To extract read counts We can then use Genomic features in r to compute rpkms and then we're going to use preprocess core for normalization And of course, all of these parts are more or less flexible, right? You could choose a different aligner You could use a different tool to do indel realignment. You can use a different tool to extract your read count But this is kind of the pipeline that I wanted to set up Because this worked relatively well for me in the past There are tools that you can swap in and out. There's there's many different tools which do adapter trimming or Retrimming there's many different aligners but That is one of the nice things that I wanted to show you guys is that it's very flexible So you can move things out. You can move things in And it's kind of a modular system in a way But let's get started right because there's a lot of tools that we still need to install And of course, one of the things that's not on the list is the sra toolkit So we will use the short the sequence read archive to download data And for that we need the sra toolkit and that will of course also be installed So here I have a slide it looks a little bit messed up because I clicked some of the links but when I put the PowerPoint online This will allow you to click on the corresponding one and directly go to the page where There's more information or directly to the download page. So this is just a slide for you guys That when you download the pdf You can just click on the name of the program go there Read a little bit more about it and directly download the tools But of course, we will also download them now. Good. So fortunately r can just be installed via the Debian package manager. So we can just do a pseudo opt install our base, which will install our so Let's do that now, right? So Let's go and get the install software So I'm just going to say pseudo opt get install. So I need to get a terminal Unfortunately, I can now copy paste So I'm going to just put my terminal here make it a little bit bigger And I'm going to just paste in the command. I have to give my password It's my pseudo password, which is one two three four And it will just say I'm going to install a whole bunch of things. So I'm going to say yes And now it complains right and I already had that on the slide that sometimes it will complain about the fact That the cd rom is not there So we have to disable the cd rom because at this point I don't want to use the cd rom anymore So I'm I'm not going to mount it again. So I'm going to change that right? So If we look on the powerpoint Like I already said that sometimes it complains about the cd rom so we can just do We can update the sources list. So the sources list is where it will get its data from So this is just a list of more or less urls and paths Which it will search for software So let's do that. Let's say pseudo nano etc Oh Let me actually switch you guys back to here And then it is going to be upped which we are interested in and then it is the sources dot list And I'm just going to update that So I'm just going to find the cd which is here and I'm just going to put a hashtag in front of it I'm going to press control o for writing it out And then I'm going to press control x for exiting Good, so now we execute the command again And now it should not complain about the stupid cd rom because now we disabled the cd rom So it will check all of the different online repositories. It will download all of the tools and start installing them All right, so we'll take a little bit of time again, but I told you guys that's bioinformatics for you So next step is to install genomic features and preprocess core So genomic features is used to extract read counts all the way at the end of the pipeline And preprocess core is something that we will use for normalization But these two packages have a lot of dependencies So we have to install three different things. So we have to install lib ssl So lib ssl means that it's a development package that allows us to compile stuff which requires ssl connections xml2 is there so that we can read xml files and curl is there Based on the open ssl flavor Because we want to make connections to network and internet tools All right, so let's switch back And install those packages so you can see that r was installed successfully All right, so i'm just going to copy paste all three commands in Which might not work properly But i'm just going to install these three packages and they're relatively small so that will be really really fast All right, so now we can install or so now we can start r So let me switch to the powerpoint. So next thing is installing r, right? So we have to start r and we're going to start r as a super user Right because we want to install These genomic features and preprocess core package not just for ourselves But we want to install it for all of the users of our virtual box in case we make additional users All right, so let's do that. So i'm just going to say sudo r And then i'm just going to use the bioconductor package. So i'm just going to install the bioconductor manager Um, and it will just install in the default library And then i'm going to install preprocess core first Just because why not? And this will take a little while because it will have to has to download the package and install it But it won't be too bad. This one is relatively quick So we'll start compiling right here. You can see that it's compiling some c code which you can see from gcc It's new 99 c code Okay, it already finished. I am going to update all of the packages It doesn't take too much additional time and it's always good to have your software up to date even if it's just in a virtual box So it's it's going to take a little while And then we are going to install genomic features that will take a little bit more time Just getting things ready and set up right r is going to be important because we are going to use r to build our whole Pipeline and to call different programs and these kinds of things All right by compiling preparing for lazy loading I hope the sound of my laptop Speeding up is not too noticeable Because it is getting relatively warm I am of course using streaming software and i'm also running a virtual box So that was one of the things that I was a little bit worried about is that is it not going to kind of collapse under the weight Of all of the strain that I put on the cpu But I think so far everything's going fine I hope that it's okay guys that you don't have too many frames missing or that it's too blocky And of course the thing that it says here is not interesting at all, right? This is just compiling packages compiling packages And as long as you have all of the requirements, right? So as long as lib ssl is installed xml 2 and curl It it should all work and it should not encounter any errors But of course you never know, right? There's always a little bit of debugging that you need to do when installing these things But I think this this should be should be okay but if there's any like Helicopter noises in the background then that's just my laptop having a hard attack from doing a virtual box and obs and doing the whole like YouTube thing that Because I can of course only test so much. So I'm a little bit Worried that it will that it will break down at a certain point and say well, you're asking too much of me All right next package We just have to wait for this, right? And yeah, this is just all the way updating updating updating So it's a it's it's just a lot of additional tools that it installs that we are going to need and this is just for having better control over matrices and it's actually using like plus or the basic linear algebra system So that's pretty good And of course if you start your virtual box with four cores This part will run a lot faster as well. So always Make sure that you allocate As many cores as you can to your virtual box But don't allocate all of them right because your windows Your host operating system the thing that is running virtual box also needs to be able to survive, right? We are not we don't want to burn down the computer By giving it eight cores while the computer only has eight cores All right, so all of the packages were installed successfully then we need one more which is genomic features I'm just going to paste it in right. It's just the standard bio conductor installation of the package And it's just going to install a whole bunch of them in this case All right, so let's switch back to the power point and start thinking about the next thing to install So installing trim omatic. So in trim omatic is a is a very good program I I like it a lot. I've used it in the past it it trims our reads, right? So it snips off the ends that we don't need But it has a couple of additional requirements. So again We first want to install gith which is a version control software manager Which allows us to kind of get stuff from github and besides that we need and An and is the build system for java one of the build systems for java and it's one of the older ones So once it's done installing the packages in r We are going to install gith. We are going to install and and then of course we are going to start making Our environment, which means putting everything in the place where it belongs, right? Because trim omatic is not a package which is standardly available for debian So we need to install it somewhere And I thought that it would be good to install everything in a folder called software So we're just going to put it on our in our home folder And we are going to install this not for everyone, but just for our own user, right? So in theory another user could install a different version of trim omatic or could install a different Software tool all together to trim reads But the idea is to put everything nice and clean All right, so it's almost done. I think let's let's switch back And of course we can always install multiple things at the same time, right? We could just open up another terminal folder By saying give me a new window. Where's that terminal new window? And we could just directly start with the other installation. So let's just do that, right? So I'm going to say sudo apt install hit Just to do some things in parallel password is one two three four Very good. Yes. I want to continue And it will start installing hit And then we will install and at the same time So we can double up a little bit. Of course, we only have two course So it's not going to be And we will install and as well It will take 200 mbs, which is perfectly fine because it will also start installing java So it will also install the open JDK, which we need to compile all of the java All righty then So that's done. So we will say mcat dear Make dear make a software dear where we will put all of our software, right? So We can do cd to change our directory and then we go into the software directory. So now we need to Now we need to get a local copy of trim omatic, right? So and the local copy of trim omatic Let me switch you guys back to the powerpoint. We can get that by doing a hit clone of The github repository. It's made by usadel labs and it's called trim omatic dot hit And we can do this because we installed hit So i'm just going to copy this command And put it into the Terminal, so i'm just going to say paste And then it is going to say i'm cloning into trim omatic and this is a very small tool We will go into the trim omatic Folder which is with capital and then we are just going to say And and now you will see that it will fail And this happens a lot in bioinformatics tools And this is because we are trying to install it In debian And trim omatic the guys that made it they only test on ubuntu and ubuntu has a slightly different version of java So we can fix this very easily, right? It tells me actually what the error is It says that the target option 1.5 is no longer supported use 1.6 or later So let's do that. I made a little slide for that. So let me show you guys the slide first So the slide is very easy had like we need to update the version of java debian has a newer version So we have to open up the build dot xml file, which it complains about and then we just have to go to line 34 And we have to change the source from 1.5 to 1.6 because that is the version that is installed under debian So let's do that. So we are going to go and we are going to say activities Go to our files. We now see that we have a software folder and inside of the software folder We have trim omatic and then here we go to the build dot xml. We say right click open up with a text editor We go to line 34 and here we indeed see that it says source is version 1.5 I'm going to say version 1.6 and here version 1.6 I'm going to save it and then I'm just going to close it and run the build tool again So I'm just going to issue the end command and now it should compile it. So No problem two seconds taken and we now have trim omatic All right, so it says that it's building a jar file and this is the jar file that we are going to execute So if we want to run trim omatic to test it We can just say java minus jar and then it is built in this slash jr slash Trim omatic Release candidate one and then you see that it tells us that the usage is that we have to tell it to Do paired end or single end and it has a whole bunch of additional options that we can set Good so trim omatic installed are the packages in r are still compiling But that will finish at a certain point Good so we go back to our software folder and we prepare to do the next one So the next one is okay, so this is dramatic one we do and then it should say build successful It took zero seconds the last time Good so the next thing that we want to install is the software which is our splice transcript alignment to a reference also called star So this is our aligner, right? So this is software which allows us to take reads Take a reference genome take a reference transcript term and mash those three together So let's just install it. It follows more or less the same structure It's just using make which is the c++ build system instead of using and There are no requirements for this software to install. We already have everything. We already had installed hit So we can just say hit clone The star aligner and then we go into the source folder and we just type make and then of course we want to test it in the end So let's just do that. We can do that while it is actually installing r So I'm going to clone star So just paste it And there we go. It will download the software Take a little bit of time and then we can go into it and we can just compile it I do hope that the r thing will finish soon. It's taking up a lot of like cpu space. I'm looking at my Oh, can you sorry, I was uh, where can we find the pdfs? Yeah, I didn't put the pdfs online yet Let me actually do that. That should be relatively easy for me to do So that you guys can do the pdf Not only do we have the pdf all of the code that we are using is also available on On uh github so that you can just copy paste it in. Let me put that in the chat Open up the chat. It's still running anyway, right? So And so you can get all of the code that we're using for today here this gist Let me show you the gist actually Let's go to firefox So this is how it looks. So these are the scripts and other things for r na sec So we're not at this step yet. We're not building our gino. So we are currently doing the install software does sh So this was giving yourself pseudo rides This is then running the virtual box and then currently we are at this step for star So here is where you can find all of the commands that we will be issuing today So just click the link in chat and it will bring you to here And it has this section which is called install software. So we're still installing the software And the pdfs I will definitely make sure that the pdfs are online probably When we are Done, I will update the description So you can find it down in the description. You can find a link to the pdfs once we're once we're done And then there's another question. Can you show again how to open the file? Please. Oh, okay. Yeah, so the file Let me switch back So the way that we open up the file is we just go to activities and we say here files, right? Then Let me close this one because I already let it open. So if you go to files, it starts in your home directory, right? So we have software And then we go to trim omatic We click on the build that xml so that it's blue. We right click We say open with text editor and then it opens it up and then we just scroll down and we go to line 34 and we change the target from 1.6 or from 1.5 to 1.6 because we we are just telling it that you You need this version of java, right? And the version of java that it needs is at least 1.6 Since that is the version that is installed in debbie. Is that okay leo? So yeah, and the PDFs I will put online the PDFs will be down below in the description once we're once we're done with the with the lecture Let's see if it finished. So it's still doing a couple of things. So we the star download finished and It is still doing the compilation for the r package. It's a big big r package This genomic ranges package and it pulls in a lot of dependencies So it's just we just have to wait for it Doesn't matter because we can do two things in parallel. That's why I took two cores So we can use the other core to start compiling star So the way that we do that is we say cd Star right so go into the folder and then it it has a source folder So we have to go into the star and then into the source folder if you type ls We see that there's a whole bunch of files here But to compile this we can just type make and then it will just start compiling the software There are a lot of warnings Which is a little bit worrisome if you're compiling code Generally, if you have fc code, you you hope that there are no warnings But the software is it's academic software, right? So the star aligner is a relatively newer liner. So there are some There are some well, it's not bugs, right? It's just warnings But generally you want to have no warnings when you are compiling code Good. All right. Perfect. So good that that works All right, so it's just compiling compiling compiling on the one It does star on the other. It's doing the r package So it'll finish at a certain point in time Unfortunately star is not that not that heavy so to speak, but it takes a little bit of time And again, I want to point out for I already put the link in the description. I think When I type and in the Trimomatic directory it shows build failed Okay, but then that's not the the error if you go up There should be a line where it tells you what the error is Because this is just telling you that the compilation failed But there should be a line when you when you scroll up a little bit The line should have some kind of an error in there And you were the one doing it under Debian, right? Oh, no under Ubuntu. So So yeah, scroll up find the find the error find where it says error Could not find something or could not do this it now just had the the thing that you Copied into the chat is not the the error. It's just the the message that the compilation failed And that's always a thing, right? So That's just a warning. Warnings don't matter. So that's also not the the error, right? It says compiling 65 source files warning source release Requires target release 1.6, which is which is fine because we just set the start Oh, did you update both of them or did you only update the source? Let me show you. So if we go here We go to files And we go to Okay, so you can always test it, right? So you can test it by using the Command which is Put the command for you in chat Just to make sure that it compiled successfully Where's my firefox? There's my firefox So you can just if if it compiled successfully you can say java minus jar And then it's in the disk folder And let me actually get the command from here because I just typed it right Here, so it's this command. I'm just going to copy it from here And if you would execute that command from the home folder So if you are in the trim omatic folder, um, then you should be able to Get the output that we just had. Um, let me show you that just because star compile So star compiled, right? It it just ends with a warning So if we do dot slash sdar Then we execute star and then it says that yes, you can use star, right? It tells me that it was compiled It was compiled here And it gives you some information about star itself and the manual So let me actually go back to trim omatic cd trim omatic And then we can just use this command which is java minus jar So if you execute this command, you should say it should just show you the usage So if it tells you at this point that file not found cannot find trim omatic dot jar Then the compilation did not work Good. So the r is also done. So that is really nice. So we can actually test that. So I'm going to say library Oh, that's not the one that I want. I'm going to say library genomic ranges Genomic features, sorry All right. So that seems to load Okay Welcome to bio conductor. So that should be okay So yeah, leo, let me know if it works if it doesn't um, then just send me the whole output by email and I can help you later on All right, so that worked and we also want to load pre-process score just to check And that also works good. So all of the art packages also installed successfully All right, so we can close one of these So let's go out of trim omatic, right? So We see that we now have star installed and we have trim omatic installed which means that we are able now to um take reads Cut the bad parts off or the parts with low quality cut the adapters off And we are also able to take reads and align them to a genome We still don't have a genome. We still don't have a transcriptome, but we're getting there, right? So it's going to be more software All right. So the next step is going to be installing Picard tools So Picard tools is a little bit different install wise. Um, it starts again by getting it from github So we're going to get the latest version um from the broad institute Well, so we're just going to clone it. We're going to go in there and this time we are going to use gradle So gradle is the new build system for java And it's similar to and But you'll see that it's slightly different than and because it will start downloading a whole bunch of additional things But this is what we're going to do. So let's just quickly switch back And let's just execute the commands. So I'm just going to get them from here So I'm going to copy and I'm going to clone So I'm going to clone in Picard And takes a little bit of time to download it. Although it's going relatively fast like 100 mb per second. It's not bad. It's not bad All right, so we are going to go into Picard and then we are going to say gradle W and we want to do gradle w. Let me check and it is called shadow gr I don't know why it's called shadow gr, but that's the name of the package that we want to compile So we're just going to press enter and now it will start downloading the the software It will start the demon and it will start downloading all of the dependencies from online To be able to compile picard tools So it's still configuring but after this it will start downloading And you can see that it's downloading using both cores because it's downloading two things at the same time so it will compile the project and Then start compiling all of the dependencies And then afterwards it will compile this shadow gr Which is just the picard software itself compiled java Will throw some warnings, which is perfectly fine Seven warnings in total. I think that's the total number of warnings that we are expecting And it will execute. So now it should have compiled this So just to check we are going to say Go into the build folder and then we're going to do an ls and then there should be in lips. I think There is the picard tool. So we have the picard snapshot all that jar So this is the one so we we compile two versions of picard This is the current version from hit it. So we are just going to test that it works. So we're going to say java minus jar picard minus and It works So it tells us that there are a lot of things that we can do, right? So there's a lot of different Subtools in picard picard is more or less this swiss army knife To do everything Related to bomb files and some files. It's very similar to some tools in a way It can do things like indexing studs Hey, it can collect base calling metrics So it's a really nice kind of swiss army knife to have For your pipeline It can also do things with vcf files. So it can for example fix vcf headers And it can also add things to bomb and some files to to fix them or to update them while you're doing it All right, let's go all the way back to the software folder And then let's switch again to the presentation So after picard the next tools that we want to install is hati slib So hati slib is a very interesting tool because it is The core tool of a lot of packages We already installed hati slib When we were doing the r folder. So when we were installing these r packages, they also installed hati slib for r So it is this library which has all kinds of different tools which are used by some tools and by bcf tools So that's why i'm going to Do all of the three together so These have to be next to each other in the same directory But first I need to have a dependency. So it uses autocomf autocomf is a c tool. So it allows It allows c code to be independent of a platform So it will The autocomf tools contain tools which will scan the system that it's working on and it will generate a make file for you So that you can just use make to compile it and all of the Hardware specific things or or more or less all of the os specific things will be Fixed by autotools all right, so let's go and Install autotools first. That's going to be really really quick because it's a very very small Very small package so four amp is so it's going to install like this So after we've installed autotools, we are going to clone htslip Which is the kind of core so both some tools bcf tools and a lot of other software depends on htslip being available We are going to clone some tools as well And then we are going to clone bump tools or bcf tools for dealing with vcf and bcf files All right So we are then going into htslip and htslip is a little bit strange. Um, and Let me show you guys what I mean So when you go into htslip and you have to have a hit sub sub module in its initialization So hit as a version control system. Um, it has The ability to to have sub modules But we need to make sure that all of the sub modules from the hts library are downloaded So that's what this command does. So it it says hit sub module update all of the sub modules and do this recursively and if they're not there initialize them So and let's see what that does It's it's very very simple. So we're just going to take this command. We're going to paste it in And it's just going to download hts codex, which it depends on so and that's it. Um, it It's just a single sub package But if you forget to do this then the compilation, uh, will fail So the next step is to use auto tools, um to reconfigure the system, right? So it it It htslip can be compiled under windows under mac under linux But it needs to know Which operating system it's running on and it's using ht. It's using this auto conf To reconfigure the build system. So we're just going to say reconfigure minus i And that's also in the slides. It's also in the command. So it's just re auto reconf minus i and then it will Generate so it will look at which system am i on am i on linux am i on mac or am i under windows? And then it will generate a configuration file and then we can execute this configuration file to set up all of the parameters that we need So here it will tell me and here it will look at all kinds of things if they are available or not And so it looks for the compiler and is the compiler working It tests the compiler. So it does all of these steps to make sure that it can compile htslip With as many optimizations as possible, right? So some things are not available. So it will not use that Some things are available. So it will start using that so After this is done We can now just type make and then it will start compiling htslip or it will compile the c++ code So All right, let's Go back quickly So htslip auto reconf Configure make This will be very similar for some tools, right? So after we've compiled htslip We're going to compile some tools our swiss army knife for everything some and bum related And here we are going to do an auto header. So this will generate headers Then we do an auto comf again. So and we are going to tell it that we don't want any syntax warnings And then we're going to configure and we're going to make it Very similar to the one two to auto com So very similar to htslip So you can see, um, oh, no, you can't see so you can see that it's currently Compiling crump support so that we can use not only bum files or some files But also crump files, which are the new type of crump and bum file All right, very good So htslip compiled, um, it doesn't tell you that it compiled successfully But if you do let me go into here if you do ls Then you can see that it that That it did compile Something and what it compiled which we are going to need later on is bg zip So this is a blocked gzip But it compiled some other programs as well So the hts file is also compiled. So these in green These are executable files. Um, it also compiled our lip hts.so And this is our dynamic library that Some tools and bum tool need All right, so let's go back And we are going to go into the some tools folder. We are going to say auto header Um, then we are going to say auto conf Minus They no syntax Right, don't show us all of the syntax warnings. Oh, this needs to be a minus So this will tell the configuration script that we're not really interested in all of the config Ray are all of the syntax errors that are there. Um, it's not syntax errors. It's, um, syntax warnings Then we're going to do configure to configure it. So it knows which compiler we're using which system we're on And then we're just going to make and then it will start making some tools From scratch and you can see here that it does this inclusion of hts lip So it is using the latest version of hts lip to compile our code. Um, and these two need to match right So the the hts lip version needs to match the sum tools version and this is very important because if it mismatches Then there's going to be issues All right, some tools done So I made a similar slide from some tools for bcf tools, but bcf tools requires the exact same commands Because it's made by the same guys, right? So it's it's not that it's uh that different All right, so we go to bcf tools We do an ls to see what's there and then we just say auto header So I can just press the up key to get back the previous commands that I did So I'm just going to use press up up up make sure that I get the exact same commands that I did last time since they worked so Make the headers make the configuration do the configuration and then make Build build the software And then after we're done, we're going to test them right just to make sure that both sum tools and bcf tools work Leo says make yeah Make that's the build system for c So you nowadays have cmake as well, which is kind of a combination of auto tools and make But make is the standard way of of building Building code Under linux and under windows. So make files are are very very generic Again, you see that it includes htslip. So these these versions need to match together To make sure that everything works. All right, so let's test bcf tools So when I do bcf tools, I do dot slash bcf tools to execute it and you can see that it That it works It's all meant to type it in my terminal. No worries. No worries You can be as loud and chat as you want. Actually, it helps being loud and chat Because youtube thinks that oh, this is a nice stream. There's a lot of people chatting So it will suggest the stream to more people All right, so we have bcf tools. Let's test our sum tools as well So i'm just going to do dot slash sum tools. The nice thing is in in linux You can actually press top so you can press sa then press top and it will Auto complete the command for you And that also works right so you can see that it it actually is version blah blah blah It's using this htslip and it's a tool for alignment in the sum format. It also does bomb formats as well Good. So that's another set of tools that we need in our pipeline. We need some tools. We need bomb tools Bomb tools. We need bcf tools We also need Some of the tools from htslip like the blocked gzip So that's kind of what it is So let's go out of the sum tools folder and then move on to the next Installation that we need So the next installation that we're going to need and I have to switch to powerpoint is gatk So the gatk is a massive massive software tool It contains a lot of different Tools related to sequencing and these kinds of things We are going to use it for the recalibration around indels and we are going to use it for The the fact that it can do a lot of things with some and bump files. It contains a lot of qc tools as well So we need to download it and Normally, I would always like to download it and compile it myself The problem with gatk is is that downloading it because it is such a massive tool It is around two gigabytes in size When you take the whole repository So I decided to just get the pre-compiled binary So that means that we're just going to get the software and then we're just going to unzip it and that's going to be it We can compile it from source. You have all of the requirements already installed It compiles the same as pcar tools So you could just do a hit clone of the gatk repository and then use gradle to build it But since it takes so much time to download the whole repository And it takes a lot of time to compile all of the code I thought it would be quicker to just download it and directly execute it Unzip it right so that it's directly done All right, so let's do that. So let's go back to the terminal and Oh, I did put this code in here. That's weird All right, so, um, how am I going to do that? So I just noticed that My script that I was using right the one that I put online here Has no gatk code. So I forgot to to put that in So that's a little bit annoying, but we can do it slightly differently. So I'm just going to get the link from online gatk download So just that I don't have to type it in because I don't like typing in long paths Download the latest version I can actually show you guys what I'm doing as well So I'm just going to go here and then here it says download the latest latest release. So that's what I did, right? I just search for gatk download click here. Click on the download And then here you have the zip file. So I'm just going to do right click copy link Then we're going to go back to our debian and we're just going to do a wget of this address, right? So it will start downloading it And downloading it in this case because you don't get all of the history It's still a large software package, right? But you don't get Like the five gigs of history that that is inside of the project Um Can you paste the link in the chat please? Yes, of course. Of course. So the gatk link is here No worries. No worries So you're actually quite on on track then leo to uh get everything installed if you're at the same step that Can we use konda mamba to install packages? You could if you want But then you're not really doing it from scratch. Then you're just using konda So we are downloading More or less the latest version from hit up from everything And that is why We are just doing it step by step more or less from scratch All right, so we got our zip file ready And we are just going to unzip it. So I'm going to say unzip ga press stop it will auto complete for me and I'm going to press enter and it will do the gatk So to test this we can do can do java minus jar We are going to say gatk and then it is gatk and if you press stop a couple of times it actually gives you the suggestions So you can see that there are Um a lot of different So in this case, we are going to use the local version of gatk If you're on a cluster or on a multi core mission not a multicore machine, but if you're on a On a multi node cluster you can use the spark version So this allows you to distribute jobs across a cluster off of computers But we're going to do the gatk minus package and then we are going to test the local version. We just press enter And then we see that we see something very similar to what we saw when we did pcar tools So it shows you that all of these different tools are available So there's a tool which is methylation type caller gather tranches reblock gvcf There's even some of the pcar tools in here I don't know if gatk contains all of the pcar tools, but a lot of the pcar tools are also available via gs gatk All right, so that's the gatk installed very good Almost there, right? So we almost have everything ready because we're almost if you think back to the list that we had of all of the software Then the gatk was one of the last tools on the list So I think that the next code is also not going to be there But the final thing that we need is the sequence read archive toolkit So the sra toolkit is this toolkit which allows us to automatically download sequences from the sequence read archive so If you think about sequencing, right if you have a sequencing run Then you get two or if you have a paired end run you get two files you get a right side read and a left side read And these files are humongous. They are literally gigabytes big So you can download them using your your firefox or chrome or these kinds of things, but that's very slow and the gatk made this Sra toolkit to allow you to very quickly download these large files So it allows this download of these files in parallel The big issue is is that the sra toolkit again is very very big And it is not a valium for debian Fortunately, it is available available for centOS So i'm just going to install the centOS version in debian and this will work Because linux one and linux two are just little different flavors from each other So let me get the the address. I have to switch back to here then I'm just going to very quickly get it out of the power point and then also throw it in chat so that the people Using it or following it can directly get it as well because it's this massive link again um So gatk sra toolkit so the link for the sra toolkit is here and i'm going to first do it in chat So this is the one that you need For ubuntu if you are doing this under ubuntu There is an ubuntu version available as well. So let me Actually look at that So let me go to the ubuntu version here um, so if we go to Us r a toolkit We go here And then we have The sdk faster dump Errors during download let me see Where is it sra toolkit here and it recently changed so you can go to the download page So for ubuntu There is a non-sudo tar archive the same for centOS by the way But we are going to use the Upget install script We're not going to use the script, but we are going to install it ourselves Um, so make sure that if you are doing this on ubuntu that you take the ubuntu version If you are doing this under debian the operating system that we are using we are going to use the centOS version just to to make sure All right, so I have the link. I think still under my clicker. So I'm just going to say we get and paste So just get this sra toolkit Then the next step is to unzip it So, um, we are first going to make a directory because this is a tar file And this is normally used to extract over your whole linux system, which I don't like I like things to be locally installed and not globally installed So what I'm going to do is I'm going to make a directory. Oh, um, I'm typing somewhere else Where am I typing back? So I am going to click here to make sure that I have my mouse here So I'm going to make a directory called sra toolkit Right, so I'm going to extract everything in this file in this tar.gz file into the sra toolkit So how am I going to do this and this is a little Like we call it a donkey bridge. Um It's a mnemonic in english So you're going to say tar then you do minus and then you say Extract the file right to x z f Extract the file. So there's this famous, um Type in a tar command, which is valid, right? Like they're diffusing one of these nuclear bombs and it's this this cartoon So they're diffusing this nuclear bomb and then it says like type in a valid tar command And the only thing that you have to remember is tar minus Extract so x z z file extract the file So we are going to extract the file called sra toolkit dot 3.0 So this whole big file name and then we are going to say minus c So where do we want to extract it to but we want to extract it to the sra toolkit folder that we just made And then we're just going to press enter and then it will extract everything from this tar archive into the sra toolkit Good. So then and this is this is really interesting. I like this a lot. So to use the sra toolkit, um, you have to use this, um kind of archaic configuration tool So it's called the vdb config interactive tool, but you have to run it and You can do a little bit of setup like where do I want my stuff stored? So I wanted to show you guys that as well because the sra toolkit is a very important tool to use So I I need to have this command So I'm just going to have you guys go back I'm going to close it and I'm just going to copy paste it in I'm going to copy paste it into the chat as well for you guys So that everyone following a wrong has this sra toolkit So what we're going to do is we're going to go dot slash which means execute from the sra toolkit folder user local ncbi been Vbd config So we are going to execute this vbd config. So let me paste it in here as well. Oh something went wrong there Because it's sra toolkit vbd config minus interactive And then that's it. I don't know why there was something in the back And in the front. Oh, I'm deleting stuff again from the powerpoint So bad. All right. So let's run this interactive configuration Um This is the problem. It needs to be a little bit different minus. So let me Unknown argument minus n How do you mean minus n minus Interactive All right, I'm just going to go into the folder. So I'm going to go into sra toolkit. I'm going to go into user local sra Local ncbi sra toolkit been And then I'm just going to say v The b minus config Tar 9 gave me this warning You must specify it one of the so did ah, right. Yeah, it's probably the uh, it's probably the minus sign So it is a tar minus Oh tar Minus I downloaded the zip file for ubuntu from the website. All right You must specify. Yeah, but you you specified minus x Extract the file, right and then the name of the tar so like this Because it says that you did not spy. Uh, so specify an acd You are ux, but you should have specified an x into the command So it's tar minus extract survival then a space and then the name of the big tar file So I think that that's the uh, that's the issue there. Let me actually fix my own error. Um, interactive Why does that not work then? Let me see because pbd config Just executed. Oh That is interesting system Vbd validate Vbd Where's the vbd config Ha, interesting. Um, because in the end I only want one tool faster qdom Yeah, so Please run vbd config. Ah, it's minus minus interactive. Ah, that's the issue. That is the issue Um Yeah, so I'm just forgetting uh A minus. Let me update that as well in the uh in the excel Our excel in the powerpoint, right? So it's vbd config minus minus interactive So if we execute it then we get to see this screen So this screen looks very archaic and it is very archaic. This is kind of how we set up tools in Well 2001 or something. So the only thing that we have to specify here is the cache folder So we can press top top top top top and you see this little Red thingy just jumping from one to another, right? So I'm going to say stop when it's in front of cache. I'm going to press enter I'm going to press stop again And I'm going to choose a location of our user repository. So I'm going to press enter And then where do I want my file stored? And this is where I'm going to top again So I'm going to top all the way to create directory And then I'm going to say Um Cache cache underscore NCBI And I'm going to make a new folder Then I'm going to go to that folder and I'm just going to press enter And then I'm going to say top Okay Do you want to change the location to home then he cash and cvi and I'm going to say yes Right. So I'm setting up the sra toolkit The way that I'm doing that is just by specifying the cache directory because it needs to know where to store temporary files All right, so now it's it's all set. So we can now just do top top top top top top again So we go top top top top all the way until we hit exit we press enter it asks us do we want to save the changes Yes, we want to save the changes Everything was saved successfully And now we should be able to Execute the tool that we want Because the tool that we want is this faster queue dump Right, this allows us to dump fast queue files From the sequence read archive directly to our hard drive So dot slash faster queue dump and now it works. Everything is set up Perfectly fine. This is what we want, right? So in the end we just want to execute the program and it has to tell us that it's a faster queue dump A path options these kinds of things Good. So a little bit of confusion. What is the command line we used to open this thing? Yeah, so Let me Do that once more. So we go And we type so we go all the way in right so we say in software go into the sra toolkit Going to user going to local going to ncbi going to sra tools going to the bin folder Then there we have dot slash To execute and the program that we want to execute is vdb minus config and then we say minus minus interactive So let me just throw that in the chat for you No, not copy as html just copy and i'm going to throw that in cash Sorry, there has to be a dot in front. So dot slash vbd content interactive, right? So this allows you to Interactively execute this thing. So let me execute it for you again. So it's very archaic. I agree It's something that looks like 1990s, right? And it's just for setting up your cash folder I don't have a user directory in sra tools Okay, so You could try to Because depending on how you extracted the file it might extract over Over the other stuff I would try getting the Ah, it's not a you. Yeah. No, it's usr. So for user So you could just press stop, right? So let me actually go out top exit Right. So this is the path the way that it looks for me Software is the one that we made then we have sra toolkit usr local ncbi sra tools and then bin So yeah, that might be the issue that it's uh Yeah, it's just usr So if you go in, um, like I said, if you go to cd software CD sra toolkit and then if you press stop, right? It shows you what there is So there should be at this etc folder and a usr folder. So usr Then you can just continue pressing top because there's only one folder in there at each time And then okay, so you didn't extract it properly Yeah, just google for um, because it might be a little bit different between ubuntu and devian And then you can You could actually like Rewind this the stream to that point Do you mind if I continue with The others because we still have like we've always been almost been screaming for two hours And I was planning on only doing two hours, but we're almost there fortunately So the sra toolkit actually We're only going to use it today to set it up. I will actually update the Yep, that is fine Um, don't worry go ahead. All right, tough that school and daily berry. I actually allowed you to do the Because it it for some reason youtube blocked your comment Just because it's a dark amount like valid dark commands are are difficult All right, so after we've set up the sra toolkit, right? We now have everything. So let's go back to the power point So we have everything so we can do We have all of the tools that we need for our pipeline But now I want to make sure that I don't have to type all of these paths all of the time Right, so I'm going to make a little bin folder and then inside of this bin folder I'm going to sim link all of our tools So that for example when we update star because if a new update comes out, we can just do a hit pool It will pull down the newest version we type make and it will update the latest star So what I'm going to do is I'm going to just go Back, right? So I'm going to Show you guys my screen again. So I'm going to go cd dot dot I'm going to make a layer which is called bin and here all of the binary files are going to be linked in Then I'm going to go into the bin folder and I'm going to make symbolic links Meaning that I am going to make a Star executable here, which is not a real file But it is a file which points to this star path, right? So to this path of star which is home denny software star source star, right? Because that's where the executable is For bgzip the executable is here home denny software how to slib bgzip But by making symbolic links, I can actually update the code in the future, right? I can just Put a new version there and the symbolic link will then point to the newest version instead of to the old version all right, so let's do this and I think again, I did not put this part on the On the on the online Repository, so let me actually copy it from here and then put it for the online rest Put it inside the online repository for you guys because Really sorry about that. I have to update that to make sure right All right, so let me get that Let me put it in here. Let's get the lm minus s is All of them Oh, yeah, so you guys can't see what i'm doing, but i'm just i'm just making sure that I update the The home uh art at the home folder like i've been streaming for two hours or more, so um, so like Um, I can actually show you guys so no plus plus, right? So I just I just put all of the commands in here And i'm just going to update the um scripts online So that the online script also contains it so that you can just get it from there So this is in install software. I'm going to just say edit this The edit button is all the way at the top I'm going to do install software all the way to down And i'm also going to update the other commands in there. So i'm going to just say hashtag Make symbolic Links right and I can show you guys that as well I'm doing that. So i'm just adding um this. So i'm going to say make their bin CD bin and then this is the command to do it and then for some reason stopped at bcf tools. So After bcf tools, we had the g8k Let me actually put that in as well I'm going to say they get g8k And then it was just unzip the g8k We put that in as well and then we had the SRA Short read archive and that's a couple more commands But short read archive is this they get Make the directory for it Just so that you guys can follow along And then our extracts a file It seems to be okay. And then the last one is interactive So and this you can execute directly from here, but it's actually not It's minus minus interactive. All right. So now we have everything again So make the symbolic links and then I'm just going to say update And then you guys should have it as well. So if you go now to the gist link Then you should be able to now see the SRA commands and the symbolic links as well Good. All right. So let's make all of these symbolic links so that we are sure that we can Execute all of our programs. So I'm just going to paste them all in one go, right? So it's linking and now if I look at the bin folder, then now I have bcf tools bg sip foster qdim star samtools and star all In this bin folder So now I have to do the next step. So the next step is updating the Bash rc file, right? So the bash rc file in linux determines when they open up a terminal What things are available on my path? So I'll show you guys the power point, right? So the bash rc file is this file that when you log into a system or when you open up a terminal It looks at this file and then sees I need to do all of these things before the user gets control And in this case, we want to add this home bin folder to the path to make sure that every time that they open up a terminal that I can just type star and that it will execute the star Command for me, right? I'm actually wondering if it did the other ones Yeah, so it also linked the other ones. So um, so what we're going to do is we're going to open our Bash rc file. We're going to put this line in here all the way in the end and then we're just going to save our file. So again, I remember that these are not here. So I'm going to edit those in, make sure that you guys can do that as well. So let's go back to here and then I can go and see and update the path. So I'm going to show you guys what I'm doing in Firefox. So I'm just going to add this right to the install software. After linking this, we are going to update the bash rc file and that's just nano and then this is what you have to put at the end. So I'm going to say hashtag and this. All right, and I'm going to update so that you guys have an updated version. All right. So let's do this in our command. So let's close this one. Go here, go to our home folder. Then I'm going to update my bash rc file. I'm going to go all the way down to the end. It's a big bash rc file. Going to press enter a couple of time and I'm going to add this to my path. So the only thing that is going to do is say that every time that you start your terminal, add the home bin folder to the path variable and the path variable determines where you look for executables. I'm going to press Ctrl O to write it out and press Ctrl X to exit. All right. So now I'm going to close my terminal. I'm going to close this one and I'm going to open up a new terminal. So now when I type star, it will execute the star aligner. When I type some tools, it will execute some tools. When I type bcf tools, it will execute bcf tools. When I type bgzip, it will execute bgzip. So now we have our system set up. We have all of the tools that we need and we are ready to start with RNA sequencing. Very good. So this was where I wanted to be at least at the end. There are still a couple of slides because there are still a couple more things that we can do in this case. But let me show you. All right. So the next thing is a new terminal, right? So we can execute some tools, bgzip, star, bcf tools, and faster queue dump with by just typing the commands. We also have installed Trimomatic for read and adapter trimming. We have the GATK. We have the Picard tools for various BumSum tools. So we have all of the building blocks ready to start building our own RNA sequencing pipeline. And we did this all from scratch, right? It took us two hours, but we now have everything ready. So we need to have a reference genome, right? And I thought about this a lot. And first I thought, should we do this for mouse? But the mouse reference is pretty big. And since I'm using a virtual machine, I only have two CPU cores available. So doing all of this for mouse would take hours and hours and hours because we need to get a reference genome. Then we have to index it. So it has to make like these indexes so that it can very quickly go to for specific parts of the mouse genome. But that's of course not going to work. So I decided to do something with yeast. Say, I'm not going to say the name. I can't pronounce it. But if, and let me actually show you that. So if we go to ensemble, right, to get a genome. So if we go to ensemble, the way that I always get there is just type ensemble FTP, right? You take the first one. And then in the ensemble FTP, if you scroll down a little bit, you see that it has all of the animals available here, right? So we have human, mouse, zebrafish, and all of the and normally when you click on this, you go for example, you look at the human FASTA files, right? Then you see that there's FASTA files for each of the chromosomes. So this is just the sequence chromosome by chromosome by chromosome. In the end, we now see that there's a non-chromosomal as well, which is reads or parts of the genome that haven't been assigned a chromosome. But the thing that people always wonder about is these two, right? So we have the human chromosome, primary assembly, and we have the top level. And I always get questions from people when I do alignment, should I use the top level or should I use the primary assembly? So the answer to that is that use the primary assembly, because the primary assembly only contains the real chromosomes, chromosome 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, X, Y, Z, and the mitochondria, right? But it doesn't contain all of the scaffolds and all of the other more or less junk that we haven't been assigned the human genome yet, right? So that's that's one of these things in the case that you have different flavors of your reference genome, always take the primary assembly, never take the top level. Then if we go back very quickly, we see also that we have DNA, right? Or we have DNA underscore RM. If we scroll down a little bit, we have also DNA SM. So what does this mean? Why do we have three different versions of the top level, for example, right? And why do we have RM or SM? So this has to do with the repeat masking, right? So in a genome, there's a lot of repeats. So we have DNA, which is the raw DNA sequence with all of the repeats still in there. We have SM, which is the soft marked, and then we have the HM version or the RM version, which is the repeat masked hard. So in this case, since we are going to align against the genome, we always want to use the primary assembly and then the DNA version that is available. So just as an intermet, so about which version should you use when you do RNA or DNA sequencing? So we also need the transcriptome and why do we need the transcriptome? Well, we need that for the internal exon boundaries, right? Because if we are sequencing mature RNA, then the introns are spliced out. So our sequencer needs to be told that if I find a read which overlaps Exon 1 and Exon 2, it needs to know that Exon 1 and Exon 2, although physically very far apart on the genome, are actually physically very close in the messenger RNA, the mature messenger RNA. So that is why we also need the transcriptome for alignment. Okay, so this is more or less where I wanted to stop today, because we are going to set up our own genome for sarcomyces or VCA. It is a 12 megabase genome, so it's really, really short, but it is a eukaryotic cell, which still means that it has introns and exons. It has 16 chromosomes, and it is the first eukaryote which was sequenced. And the reference strain, and this is very important, is S288C. So this is the reference strain that we have. And if we are going to line RNA sequencing reads, we should align sequencing reads off the same strain to this reference. Because if we take other strains, which have mutations or insertions or big deletions, we might run into issues when we do RNA sacrolima. But this is the one that we're going to use. So we are going to set up our genome. And to set up our genome, the big issue is, is that when we look at the ensemble database, sarcomyces suffigie only has the top level available. So it only has the full genome sequence with all of the junk in there. It does not have a primary assembly. So that means that not only are all of the chromosomes in, but all of the other scaffolds and patches and patch versions are also in. So we need to create our own primary assembly using R. And that is where I'm going to stop today, because I've been streaming for two hours and five minutes, which is long enough on a Sunday, still want to do some other stuff with my Sunday. And next time we are going to start here. You already have the code, right? If you look at the code online, then you can see that I already had some more code there. But we are going to write our own little R script next time. And we're going to start there to make our own primary assembly. And then we're going to do our own transcriptome or we're going to download the transcriptome, make it fit to our own primary assembly. And then we are going to start aligning reads. All right, Leo, thank you for following, thank you for the questions as well. Like it's always nice when people are actively following in chat. Good. So I'm just going to scroll very quickly to the end. We're all set, right? So we have, we have all of the tools. Next time we'll go to the first step. So we're going to do our first RNA alignment. We are going to extract our PKM values. We are going to test differential expressions. And we are going to build a flexible pipeline with our scripts. And we're going to add a little bit of automated QC. How do we get the updates for the next time? How do you mean the updates? You mean when I'm going to stream? Just subscribe to my channel. And I just, I updated the GitHub Leo, so it should be there now. So if you just press F5, let me get the, get the code. I should, I think I updated it live. So it should be there. So, yeah, just subscribe to my channel. And they will appear. I will probably plan some, some dates, probably in two weeks, probably Sunday, not next week, but the week after next week. And I'm a little bit fine, a little bit busy. So I will have to do some other stuff. But I think they all, all of the code that I added should be on, on the bottom. So just for anyone watching, I'm going to just put the link in chat again, so that you guys have it. It's also down in the description. So all of the code is there. So like I said, next time, what are we going to do your first alignment? We're going to do RPKM values, a little bit of differential expression testing, and then we're going to start building up this flexible pipeline where we can swap out different tools. So if we want to do alignment of DNA, we might want to use BWA. If we want to do RNA alignment, we might want to use Star, or you might want to use a completely different alignment or aligner all together. All right. And then we're going to do the automated QC. So because, of course, we have no real tools yet to look at the quality of the reads. All right. Been updated. Couldn't just get past the extraction of the Jesus file. Yeah. Yeah. Yeah. No, it should be fine. All right. So that's it. Thank you guys so much for watching today. In number two, I told you guys what we will do. And I wish you guys a very, very happy Sunday. And thank you so much for being here. And like, subscribe and favor it, of course. It really helps out with discoverability. And if you think that other people might be interested, feel free to promote and tell people about it because that really helps. It's really difficult for these kinds of things to kind of automatically be advised on YouTube because YouTube is very bad at figuring out who's a bioinformatician for some reason. So all right. Thank you so much. And enjoy your the rest of your Sunday. And I will see you guys probably in two weeks for the second part. And then we will start really doing some RNA sac alignment. So thank you for being here and see you next time.