10th International Conference on Utility and Cloud Computing (UCC) / 4th International Conference on Big Data Computing, Applications and Technologies (BDCAT), Texas, United States Of America, 5 - 08 December 2017, pp.189-197
Genomics data is unstructured and mostly stored on hard disks. It is both technically and culturally residing in big data domain due to the challenges of volume, velocity and variety. Huge volumes of data are generated from diverse sources in different formats and at a high frequency. Appropriate data models are required to accommodate these data formats for analysing and producing required results with a quick response time. Genomics data can be analysed for a variety of purposes. Existing genomics data analysis pipelines are disk I/O intensive and focus on optimizing data processing for individual analysis tasks. Intensive disk I/O operations and focus on optimizing individual analysis tasks are the biggest bottleneck of existing genomics analysis pipelines. Making any updates in genomics data require reading the whole data set again.