PTHP: Index for Optimizing Genome Assembly Overlapping and Read Alignment

Authors

  • Sherif Magdy Mohamed Abdelaziz Barakat School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Skudai, Malaysia
  • Roselina Sallehuddin School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Skudai, Malaysia.
  • Siti Sophiayati Yuhaniz Razak Faculty of Technology and Informatics, Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia
  • Raja Farhana Raja Khairuddin Department of Biology, Universiti Pendidikan Sultan Idris Tanjung, Malim, Malaysia.
  • Yusliza Yusoff School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Skudai, Malaysia.

DOI:

https://doi.org/10.15379/ijmst.v10i1.2690

Keywords:

Data structures; Genome assembly; Genome indexing; Genome assembly overlapping; Genome assembly read alignment; Genome assembly performance; Genome prefix tree index; Genome hash index

Abstract

Unfortunately, sequencing technology can only access the genome sequence as massive numbers of short strings are called reads. The genome assembly process constructs the complete genome from these reads based on the overlapping between the reads, called the de novo approach, or aligns the reads based on their positions in the available reference genome, called the reference-guided approach. Millions of reads search for overlapping or alignment, a well-known data structure problem called all-against-all. Many studies have proposed indexing such as hash index, prefix tree index, and parallelization technique to optimize the overlapping or the read alignment individually. However, due to the massive data amount and the repeats, limitations still affect the index efficiency, requiring more enhancements. This article introduces a new hybrid index named Prefix Tree Hash Partitioned index(PTHP), which combines prefix-tree index, hash index, pigeonhole concept, and parallelization. PTHP index reveals significant results on the simulation and real dataset, reducing the computational time complexity of overlapping and read alignment, thus the assembly time outperforming prefix tree index and hash index. Improving the performance of overlapping and read alignment using the PTHP index reveals great results in optimizing the hybrid genome assembly that combines both.

Downloads

Download data is not yet available.

Downloads

Published

2023-07-13

How to Cite

[1]
S. M. M. A. . Barakat, R. . Sallehuddin, S. S. . Yuhaniz, R. F. R. Khairuddin, and Y. . Yusoff, “PTHP: Index for Optimizing Genome Assembly Overlapping and Read Alignment”, ijmst, vol. 10, no. 1, pp. 958-972, Jul. 2023.