Great! You obtained high-throughput sequencing data of the virus you are interested in! Now, how will you deal with this rather intimidating fastq files? I will not go here over basic high-throughput sequencing (HTS)/NGS stuff, there is a lot of great tutorials everywhere on the web.

If you are here, it is probably because I told you (or someone else!) that you may find here a useful PRACTICAL guide on how to deal with HTS viral data. I will present in this short tutorial how I usually deal with HTS data in case of (i) resequencing/mapping/variant calling or (ii) de novo assembly. These strategies have been used in published papers of mine. You’ll find some reference in the relevant sections.


I will present here the general philosophy and tools I’m using. I cannot however give you a full functionning pipeline: these things depend on the infrastructure you have at your disposal. For example, I work with a Dell Precision Tower 7810 with a total of 48 CPU threads and 64 GB of RAM, running Ubuntu 16.04.3 LTS. The code I will present here work perfectly fine on this computer. It may not, however run fine on your setting.

This guide will thus not provide you a turnkey solution: you will probably need to learn how to code (at least in shell) at some point… at the very least to adapt my own code to your personal settings.

On the other hand, I’m not a bioinformatician, nor a coder: my code is probably not perfect. Also, you may prefer other tools than the ones I’m using, certainly for very good reasons. The programs I chose to work with are chosen based on my very own opinion (made from easy-to-use-ness, robustness, etc.).

With that in mind, let’s proceed…

This tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.