 So hello and welcome everybody to another talk of the Bite Size Talk series that is offered by the NFCO community and we should mention that it's receiving support from the Chan Zuckerberg Initiative. So we thank Ful for that. So today, Phil Iwels is back and he will tell us more about multi-QC and how to customize multi-QC reports, for example, your own pipeline. So thanks for joining us Phil today. Thank you for having me. Sorry that I've ended up doing two Bite Size Talks in two weeks since there have been a bit of a reschedule shuffle. I hope he won't be too tired of my voice already. Yeah, so today's talk is a bit of a break from what I've spoken about previously with Bite Size in that it doesn't really talk about NFCore at all. This talk is purely about multi-QC, which is one of my other kind of pet projects which I've been working on for a few years now. But multi-QC is used very heavily within the majority of NFCore pipelines. So we figure it's kind of a relevant topic for most NFCore developers certainly, but also people using NFCore pipelines as well. So today I'm going to start off with a quick introduction just for those people who might be watching who have no idea what multi-QC is. And then I'll talk about a few tips for people developing pipelines for recommendations to get the most out of multi-QC and then a few recommendations for people who are running NFCore pipeline, especially usually this is most relevant for people working in kind of facilities or large-scale routine processing places. But of course it can be used by anyone. So what is multi-QC? Basically multi-QC is to help this little guy who's sat wading through a text file, hundreds and hundreds of text files at the end of his analysis or her analysis, all these log files in the terminal trying to work out whether the analysis worked or not. And also trying to work out if there are any bad samples in his project. And what it does is it takes all of those text files and it visualizes them within a report. So you get a nice kind of shiny graphical thing that is more human-readable and you can kind of see at a glance hopefully roughly how everything's gone and if there are any samples which might need a closer look. It supports in a single report multiple different bioinformatics tools, 115 I think or something like that we're at at the moment. So the vast majority of kind of commonly used bioinformatics tools are represented out of the box. And it also handles multiple samples. So if you have five samples in your project or 500 samples in your project, multi-QC will suck up all those different log outputs and summarize all of that into one single report for you. As well as the HTML report that it generates, multi-QC also spits out a bunch of other files, which give you a nice kind of standardized output. Bioinformatics tools are famous for lacking standards in file formats. So multi-QC does some legwork for you and you can choose to have, gives you tab separated files by default, but you can have YAML or JSON as well. And all the different 115 bioinformatics tools will produce output, which is in roughly the same flavor. So it's useful for downstream processing as well. Multi-QC is written with Python, so it's pretty easy to install if you've got Python set up on your system using a Python package index here. So pip install multi-QC, it's also in conda or you can use it with Docker or Singularity. There's a Galaxy app before it. Most places, you're already running software, you'll find multi-QC there as a Debian installation and all sorts. Then to run multi-QC, you call the multi-QC command, which is the tool name. And then it needs a minimum of just one argument, which is a file path. So in this case, I've given it a dot, which just means the current working directory in the terminal. And multi-QC then will recursively look through every file and folder in that path and see what it can find. And anything that's not the log file, although it doesn't recognize it, we'll just ignore. So it's been designed from a ground up to work with analysis pipelines where you have all of your results in a folder and then you just run multi-QC folder and it will find what's relevant. If you want, it can also take explicit file names and as many different tasks as you want, if that's better for your setup. And that's it. Once you've run multi-QC, it will tell you it's generated an HTML report and then it's up to you, the human, to do the difficult bit, which is to look at that report, understand what it's telling you and kind of continue with your analysis. I started multi-QC back in 2016, I think. And it has wildly exceeded expectations. I was looking this up yesterday. If any of you follow me on Twitter, you might have seen or just passed 2000 citations, now paper citations for multi-QC paper, which is just utterly kind of mind-blowing. I certainly didn't set out with any expectations of this. It was just an internal tool that we needed at SILIFE lab for our own internal QC. So it's very humbling that multi-QC has reached so many people and helped so many people. There you go, 114 different bioinformatics tools are supported, more coming in all the time. Those citations, I find quite terrifying, if I'm honest. You can see I got this kind of going up and up and it makes me always very scared to push a new release because I always think there's always people using it. What happens if I've broken something or worse? What happens if I find out that something has been broken for the past three years and all these citations are wrong? But anyway, that's for maintainers and nightmares. And if ever I'm slow to respond to you, if you've opened an issue or a poor request to multi-QC, this is my defence. We've had just over a thousand issues created on GitHub now for multi-QC. It's nearly 150 of them still open that need closing and there's been over 500 poor requests, so people contributing code and people's contributions account for the majority of tools supported now. So it's really a collaborative effort. I'm the gatekeeper and I hold all the keys, so it has to get past me to get into multi-QC, but most of the code is not written by me anymore. And again, there's always a long list of poor requests open because it takes me quite a long time to go through them. That sounds a lot. It is. I worked out how many days it's been since the first commit to the multi-QC repo and it works out about one issue every couple of days. So it's a lot to go through. So please be patient. Do my best. Right, that's the introduction. So you're happy with what multi-QC is. You've written an NFCore pipeline. Multi-QC is working, but what kind of tips and tricks can you do to really squeeze the most out of your multi-QC reports? An easy one to start with. All of this, everything I'm going to describe is in the documentation, by the way. So go to multi-QC.info and you'll find all of this and a lot more. I'm mostly just going to pick out a few things for you to kind of go and look up if it sounds interesting. But anyway, an easy one to start with is optimizing how fast multi-QC is to run. Generally, multi-QC runs within a few seconds for most things, but if you are running a lot of modules and if you've got large numbers of samples, it can start to take a few minutes or in extreme situations, up to an hour. So it can be nice to try and tune that optimization as much as possible. And there are a couple of things you can do very easily to do that. Firstly, I would recommend running multi-QC yourself with this extra command, profile runtime, and that will actually add an extra section to your reports where multi-QC kind of has an introspective look at itself and works out what it's been doing. In the log, it will tell you how long it took to run and how long it spent doing different things. So in this example here, you can see the vast majority of the times were spent looking through the different files it was given and trying to find which ones are relevant. And actually, then once it had that file list running the modules and generating the report was quite quick. And then within a multi-QC report, you get plots like this which tell you how fast or how slow different search patterns were within multi-QC. So multi-QC has a bunch of different ways to find relevant input files. The simplest is by a file name pattern. So if a tool always gives the same suffix for its output files, they're dead easy to find. So you can just kind of search through the file list and find them that way. But many, if not most, tools don't do this. And you can just call, it might just be a standard output log to a terminal or you can call your summary file whatever you want to see. So then multi-QC has to look within the file contents to find those files. And that can be a bit slow. And Picard here, you can see is one of the worst culprits often. It's got lots of different outputs it can find. So there's lots of different search patterns. And each one of these has to look through each one of your files to see if there are any matching strings. So here you can see, okay, what was run, what are the main culprits in terms of kind of slow searching and then you can know what to focus on. And then once you've kind of figured out what's actually taking time, what do you do about it? Firstly, especially within the context of writing a pipeline, it's very easy to tell multi-QC, okay, you're only going to get output from these different tools. Don't bother looking for a Picard output because I'm not willing Picard. So that speeds up things quite a bit. Then you can optimise those search patterns I mentioned. Firstly, lots of modules have sub-modules. So Picard is one tool that has about, I don't know, 15 different kind of sub-tools. And so you can disable search patterns for the stuff that you're not running. And also you can use file name patterns. So maybe the tool doesn't have a constant suffix, but maybe within the pipeline you do always have a predictable file name. So you can tell multi-QC to use that file name to find files instead and overwrite the default file name search pattern. And that can speed things up a bit. There's a section of a documentation I've linked to here which talks to all the same stuff. So go and have a look if that sounds interesting. Okay, that's kind of the boring stuff. That's just like optimisation. I had a quick look through a couple of NFCore pipelines to see what was frequently set within the multi-QC configs. And I've split up kind of a few common things which make sense. And then on the next slide I've got some stuff which I haven't seen so much of, which might be nice. So let's start off with the common stuff. One of the most frequent things that people want to do is change the default order of the different sections within a report. And so that's quite easy to do. You have a config file in the AML and you define this key top modules and you say, okay, these are the modules I'm most interested in in this order. And multi-QC will run those modules in your order you specify. It will still run everything else after that. So if you just want FASQC at the top, you just do top modules, FASQC and that will float to the top. If you want some more kind of nitty-gritty detail, you can specify the module order config which has a whole bunch of different subkeys. And this again you can use to order the modules. You can also use it to run a single module multiple times with a sub file name filter. So this is most commonly used for, for example, FASQC. If you're running FASQC twice before and after trimming, you can tell multi-QC to run the same module twice but only on its different subsets of files. And again you can also overwrite things like the title of the module and a bunch of other things in here. One of the most difficult things that multi-QC has to do is work out the name of each sample. There's no kind of idealized situation where we just magically know what your sample IDs, identifiers are. We have to kind of do our best guess. Usually that's by looking at either the file name of a log or kind of trying to find the input file name and basing it on that. But of course if you have FASQ or dot bam or whatever you have always different extensions then they look like different identifiers. So multi-QC tries to get rid of those kind of standardized extensions so that you end up with that core identifier and then everything lines up nicely across the different modules especially in that top table called general statistics. But it's generalized so we have to do our best and sometimes different pipelines have different extensions which are kind of added on. So if you see that happening especially in general stats that rows aren't lining up or you see duplicate samples which should be just one you can tell multi-QC what your custom extensions are in this config and clean them up so you get really nice clean short sample identifiers with no kind of additional cruft. Some people get really annoyed so multi-QC has to deal with massive numbers of samples. Everything like I say from like one or two samples up to thousands and tables get really unhelpful when they're super super long. You can no longer kind of summarize and take an overview view which is the whole point of multi-QC. So by default multi-QC when a table gets to I think 500 rows is the default something like that it will instead of doing a table generate what's called this B1 plot which is basically like a dot plot. If you find that really annoying you can push up that threshold at which that switch happens to effectively disable B1 plots and a few people have done that within NFCore pipelines. Right here's some stuff I didn't find which I thought might be nice to have so take note developers even if you think you already know everything there is to know about multi-QC. One of the things for multi-QC does by default at the top of every report it says when you run it and it shows the input files that you gave it so the directory where you told it to search profiles. Now for next row because analysis always runs within temporary work directories usually the place it runs is not really very interesting at all it's just going to be work and then some long kind of hash identifiers and so it might be nice just to turn that off and you can just set show analysis path to false and multi-QC will not print that at the top of the report. By default in the templates for NFCore template we have a report comments top saying this report was generated by this pipeline but you can also go further than that you can add comments to specific modules within your report and you can add as much or as little detail as you like here this is a great way of documenting the results of your custom pipeline. We have the documentation on NFCore website sure you can embed stuff within the report here so that when people are reading through you can say in this pipeline we're running this tool in this way and this is what you should look for more documentation it's always better so yeah let's see some section comments in there that'd be great. We don't really ever seem to customize the report logo I was thinking that would be something that's easy to do stick in the NFCore pipeline logo up at the top of the report if we wanted to and then yeah customizing plots themselves so multi-QC is going to be very extensible and very customizable and that extends to every single plot if you know that identifier for the plot that you're interested in you can tell multi-QC actually I want this to be the title actually I want the axes to be this axis labels you can customize pretty much every aspect of the plots even when they're coming from a built-in module so you can might be able to tweak certain things here and there to make them more understandable better suited to your outputs and on a similar line you can also customize the tables so maybe you have percent duplicates reported twice from two different tools and you're anyone's at once or something is not useful because of this or that you can tell multi-QC to ignore or hide certain columns within your tables which might be good. Something else which is used quite a lot within NFCore and actually has been a wildly successful kind of feature of multi-QC is the ability to inject custom report sections in without needing to write a module so without needing to write any Python code. This is called custom content and would typically be like output from pipeline scripts so maybe you've written a custom R script or Python script within your workflow so it's not like a general tool outside of the pipeline if it was it'd be better to write that as a multi-QC module so that everyone can benefit from it but it's just like a really specific niche thing then you can generate and you have control of the output so then you can insert that into the multi-QC report using custom content. It can be a config file or can be JSON it can be custom HTML it could be images if you want no I generally sort of dislike having images in multi-QC reports because they really bloat the HTML file size and so if you do images please make sure you don't have one per sample because quickly that will just crash the browser that tries to open the report and basically all you have to do is most of the time it's append to your file name underscore mqc.json or yaml or whatever the file format is and then as long as your file content next kind of roughly writes multi-QC or try and figure out what to do with it you can also configure lots of stuff so again you can tweak and make all the plot axes and titles exactly as you want different ways to do that with different file formats check the documentation and especially check this repo which has the test data which multi-QC uses custom content is difficult to document because you can do anything so how you can can't document everything but what I do have in this repository is lots of different examples that I've made over the time so you can kind of dig around and find different ways of doing things and model your your custom content on that right that was all for people developing pipelines what about if you're running an NFCore pipeline what can you do to tweak your own personal multi-QC reports separate from the rest of the NFCore pipeline community um basically all the NFCore pipelines because it comes in the template has a parameter for the pipeline called multi-QC config and using that you can give a custom yaml file and it's important to say that this is additional to the config which ships with the pipeline so the pipeline might be doing its own configuration stuff and then you can add your own config on top of that they work together so you can do stuff like conditional formatting for example there's something we use at NGI so in your house if you're running a same pipeline for the same data type you might say samples fail if they have under 80 alignment and I want to flag those so that they stand out nicely with red and maybe warn stuff which is between 80 and 90 alignment that easy to do any table in multi-QC report you can have these conditional formatting rules and you just set up get the identifier for that column and set up the different rules you can add project level information so if you are generating multi-QC reports from a limbs for example well or you have your own custom analysis you might want to say okay this project was called this and you might want to add some comments about what exactly it was you did or even put in different kind of custom sample names which are different to the identifiers that multi-QC finds I'll show an example of this in a second and then you can also kind of style the report so you can put in a custom load go as I mentioned earlier so you want to have like your institute logo in the corner multi-QC report no problem you can actually now as of last release just have a custom CSS file so if you know a little bit of web development you can style stuff completely differently and have different background colors and you know just hack on the default template for multi-QC quite easily with a little bit of additional CSS and if you want to take another step further you can actually develop your own entire template and supply that to multi-QC so different ginger template and really change what goes into the report and how it's rendered so a quick example of some customization this is an example report which you can actually see on the multi-QC website if you go to the top menu under examples it's the one that's the the NGI one and this is a pretty close it's taken from the reports we generate generate outside life lab at the NGI where I work and these are some of the things that we've done in our config to add additional information into the report which is useful for our users and this happens again on top of the end of core pipelines so the most obvious one is we add a title in this case we have a project identifiers and a nice title and that's done with the config attribute title we have a subtitle under there with a little bit more information this in our case is I've removed removed identifying information here but this would normally be we have like a project title where the PI has said this is what the project is about and here we have a report comment which is similar but just longer format slightly different styling and here actually it's this comes from the enter actually pre and of course this example is pretty old but it comes from that the the next row pipeline has added this but you could customize this to be whatever you want with reports comments we've put in a logo and also with that logo there's a URL and a title so if you hover over it it says the title and if you click it it will take you to the custom URL which in this case is the homepage and we've got this little panel here of custom information which called reports header info and this can be any kind of key value pairs you want so this ties in really well with a limbs or something if you have custom like report level information that you want to show just to some summary information you might also notice there's a couple of extra buttons up at the top here and that has been done with something called minus minus sample names where you give it lgqc just like a tab separated file with all your expected sample identifiers and then alternative sample names and the column headers then form buttons at the top and if I click in this case user supplied names it's something custom I've labeled it then you see all the sample names down there switch so we by default have a nj identifiers which is what's useful for us but then our end users might not really know what that is they can click that button and see all the sample names that they supplied to us really quickly really easily and all that does is just pre-populate the multi qc toolbox really quickly with lots of different sample matches that easy to do can be very very helpful and then of course if you really want to this is an example of kind of going to town with customizing your report output just to give you a flavor for what's possible if you really really go for it this is a little Easter egg in multi qc so see if you can figure that out with minus minus template okay um I won't be too much longer I'm running over a bit sorry looking to the future a couple of things to look forward to with multi qc um you might those of you have heard me talk before might recognize some of these slides here most of this stuff has been planned for multi qc since about 2018 or 19 which by coincidence is around the time that another one of my project started um kind of taking off and that one's called NF core and sucked up some of my time anyway this is stuff which is being actively worked on and will happen and this is stuff I'm kind of excited about to kick us off is um basically refactoring the code base so that it works more as a python library rather than purely a command line tool um and so now if you want to if you're using like dupeter notebooks or custom python scripts you can import multi qc and you can run it on a like this like in a kind of programmatic way on a folder and it will generate a report what you can't do yet is kind of what I want to be able to do is is kind of generate a multi qc report object and then pull out specific stats and specific plots kind of on demand and that kind of use over that internal functionality that's there at the moment that's a bit tricky um but I'm hoping to to get there soon so it'll be a really useful interactive or kind of script based analysis tool as well as a command line tool and then the other big one is is mega qc which is my my poor forgotten child that has been a bit abandoned but despite my best efforts to ignore it is being picked up by by others in the community and is being actively developed by a small but slowly growing kind of core of end users across the world and michael minton in the states is probably one of the key contributors and also core um silly norvernan in uh norway um anyway mega qc what does it do uh when you run multi qc you get kind of one report object and that's kind of frozen in time so you've got the samples you run it on in your project and that's it but many people are running in a facility like doing clinical work or whatever you're running multi qc the whole time hundreds of times a day and you're generating this kind of longitudinal data and you want to track things across projects and you can't do that in multi qc alone but this is a companion tool mega qc which is like a regular running web server tool and multi qc when you run it can spit the results to this this tool as a jason file over an api and all that is then stored in a database for you to interactively query view and plot um this is quite an old uh demo i did for a talk a while ago but this shows pulling plots which i've set up in mega qc and saved as favorites and it has an interactive tool for generating dashboards so this is really cool like you want to have a tv up in your your lab or something showing showing statuses so you can keep a track on whether whether the trend lines are working properly or whatever you can kind of really quickly drag and drop your a quick dashboard together with your favorite plots and whip it up so that saves and then you have like a static not so it's an html web page which you can then load and play around with so it's kind of you can see the different types of plots here we've got single values plotted against one of the bar graphs distribution source so you can really get the most out of all the multi qc data which is being found in your samples uh and visualizes it interrogate it and that is sort of ready to go now but it's still being actively worked on in a big way right with that i'll wrap up and happy to take any questions uh check out the multi qc website like i say all of this was documented uh so have a read through there see if you can find anything new uh all the code bases open on on github and there's a get a chat for multi qc which is a good way to kind of get my attention for quick questions uh i'm happy to list on there thanks very much for listening thanks a lot phil for this uh introduction to multi qc and showing also advanced advanced tools and characteristics of multi qc i'm sure we all learned something today um we have we do have one question in the chat um so they were wondering if we like they were wondering more about this example that you showed on quickly changing sample names what what what kind of like configs or files would we need to generate to actually change the sample names right um so you can do it a couple of ways or this is off the top of my head uh i think you can do it in multi qc config but the way that i would recommend doing it is with this option is like minus minus sample names when you run multi qc uh and like i said tab separated files that are the first column should be over identifiers which multi qc itself is finding so you know in this case you know we run with the limbs so we know when you run multi qc these are the samples we expect in this project so we know those identifiers and then in the next column along you have the equivalent names on the same row and each column will get its own button along the top um which will then sort of be able to to switch through um thinking about it now this might be slightly difficult to do within nf core pipelines because this is an additional file and flag uh to to provide to to the multi qc module and so you might need to look into doing that within the yaml file within the config file which you can give to multi qc and i'm pretty sure you can do it but i would have to check to be certain if you can't then maybe let us know and we can look at um either putting that into the nf core module or or i can look into whether it's possible to do with a multi qc config file. Thanks a lot we have also another question by Moritz um any recommendations for large next flow pipelines at multi qc uh usually we use that collect um to to make and mix everything and pass it to multi qc but however this can sometimes crash with many samples. Yeah yeah the way the next way works has always been sort of a bit ugly really for multi qc um because next flow is very explicit about your files and you need to kind of stage them as inputs and everything uh whereas multi qc kind of works really nicely when you're running it interactively and you just have a folder and you just run multi qc but with next flow you need to you need to be really explicit about staging those inputs um so the short answer is no i don't have anything better than that i'm afraid um because you need to stage them um i've talked to parallel about this kind of various times over the years and uh we've kind of discussed ways to make it easier but not really ever come up with anything better um multi qc itself if you want to give explicit file names and there's very many of them people have run into problems with like vc bio and with galaxy and stuff with this where the command line gets so ridiculously long that it crashes kind of bash or whatever environment that's running in that case you can put all the file names into a single text file and then do minus minus file names text file and it will go through all of those but that still doesn't really help with next flow because you need to still stage those files as inputs so you have to declare them as inputs um yeah sorry that's the best i've got right so in that case it's not possible to just pass the whole folder um you probably can do that yeah um i'm mostly thinking about yeah no it's a good point so i'm mostly thinking about lots of different processes because you need to stage each one of those process outputs in but you're right if you have lots of different files then um you can certainly just stick them in a directory and pass that one directory as long as it gets staged as an input multi qc command is dead easy just do multi qc dot because it you're working within that isolated work directory so that should work fine yeah so we should explore this for nfgo pipelines then um okay if you have any other questions you can also go ahead and unmute yourselves i've just um given rights for that in the chat so far we don't have any other questions what we slides up on sorry how long did it take you to make the 90s mode of multi qc i did that back in the early days when i had lots of free time still um actually less time than you would expect uh because the the default template is rendered with um bootstrap css framework and someone else had already made a bootstrap theme uh using all the right class names and everything called called ga for geocities um if you're old enough to know what geocities is um so i kind of hijacked that and then just added on a bit of extra flair on top so it actually wasn't too bad and it's it's kind of nice i i do like sticking easter eggs into into software tools a bit of mc ham and everything is a mess okay yeah what i was going to say is i'll i'll put slides up as a pdf onto the slack channel on the bite size slack channel yeah perfect seems like there's no other questions so thanks again illan as you mentioned the slides will be uploaded and the talk available also uh and we can continue any further questions on slack thanks a lot everybody so much