Faster Sorting of Aligned DNA-Read Files
- Subject:Efficient Evolutionary Bioinformatics
- Type:Bachelorarbeit
- Supervisor:
Lukas Hübner, Alexandros Stamatakis
- Student:
Dominik Siebelt
- Links:PDF
-
In the analysis of DNA sequencing data for finding disease causing mutations, to understand evolutionary relationships between species, and to find variants, DNA-Reads are compared to a reference genome. A reference genome is a representative example for a set of genes of a species. Sorting these aligned DNA-Reads by their position within the reference sequence is a crucial step in many of these down- stream analyses. SAMtools sort, a widely used tool, performs exter- nal memory sorting of aligned DNA-Reads stored in the BAM format (Binary Alignment Map). This format allows for compressed storage of alignment data. SAMtools sort provides the most comprehensive set of features while exhibiting demonstrably faster execution times than its open source alternatives. In this work, we analyze SAMtools sort for sorting BAM files and propose methods to reduce its runtime. We divide the analysis into three parts: management of temporary files, compres- sion, and input/output (IO). For the management of temporary files, we find that the maximum number of temporary files SAMtools sort can open concurrently is lower than the maximum number of open files permitted by the operating system. This results in an unnecessarily high number of merges of temporary files into larger temporary files, introduc- ing overhead as SAMtools sort performs extra write and compression operations. To overcome this, we propose a dynamic limit for the num- ber of temporary files, adapting to the operating system’s soft limit for open files. For compression, we test seven different libraries for compat- ible compression and a range of compression levels, identifying options that offer faster compression and result in a speedup of up to five times in single-threaded execution of SAMtools sort. For IO, we demonstrate that a minimal level of compression avoids IO overhead, thereby reduc- ing the runtime of SAMtools sort compared to uncompressed output. However, we also show that uncompressed output can be used in the pipelining of SAMtools commands to reduce the runtime of subsequent SAMtools commands. Our proposed modifications to SAMtools sort and user behavior have the potential to achieve speedups of up to 6. This represents an important contribution to the field of bioinformatics, considering the widespread adoption of SAMtools sort evidenced by its over 5,000 citations and over 5.1 million downloads through Bioconda.