Taxonomic classification of metagenomic sequences from Relative Abundance Index profiles using deep learning


KARAGÖZ M. A. , NALBANTOĞLU Ö. U.

Biomedical Signal Processing and Control, vol.67, 2021 (Journal Indexed in SCI) identifier identifier

  • Publication Type: Article / Article
  • Volume: 67
  • Publication Date: 2021
  • Doi Number: 10.1016/j.bspc.2021.102539
  • Title of Journal : Biomedical Signal Processing and Control
  • Keywords: Convolutional neural networks, Metagenomics, Sequence analysis, Taxonomic classification, PHYLOGENETIC CLASSIFICATION, RIBOSOMAL-RNA, GUT MICROBIOME, ALIGNMENT, IDENTIFICATION, ASSIGNMENT, TOOL

Abstract

© 2021 Elsevier LtdWe propose a Convolutional Neural Network approach based on k-mer representation for metagenomic fragment classification problem. The proposed model consists of two steps; the first step is representation of DNA based on k-mer frequency with Relative Abundance Index (RAI) and the second step is classification metagenomic fragments with CNN. RAI scores, as DNA fragment representations are fed to CNN classifiers (CNN-RAI). RAI consist of the over- and under abundance statistics gathered from the taxon for each k-mer. In order to compare the performances of CNN-RAI and RAIphy, which classifies metagenomic fragments using the same input attributes with an expectation-maximization based approach, databases of different metagenomic scenarios were tested. Metagenomics data that were generated (or simulated) by different Next-Generation Sequencing platforms, respectively Illumina technology and Oxford Nanopore MinION were compiled into shotgun metagenomics or 16S rRNA datasets. RAI based method and CNN models were trained on represented data with read lengths ranging between 200 and 10,000 bp, also with distinct k-mer size (3≤k≤7) at genus level. RAI score was used for the first time in the deep learning algorithm as a spectral representation with improved performance thanks to the ability of deep learning on each dataset for a range of parameters. The proposed representation was compared to the current spectral methods and shown to be competitive for all datasets used in this study.