Comparing the diagnostic performance of methods used in a full-factorial design multi-reader multi-case studies


Computational Statistics, vol.38, no.3, pp.1537-1553, 2023 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 38 Issue: 3
  • Publication Date: 2023
  • Doi Number: 10.1007/s00180-022-01309-1
  • Journal Name: Computational Statistics
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, ABI/INFORM, zbMATH, Civil Engineering Abstracts
  • Page Numbers: pp.1537-1553
  • Keywords: Multi-reader multi-case, Dorfman-Berbaum-Metz method, Obuchowski-Rockette method, BCa bootstrap, Diagnostic test
  • Erciyes University Affiliated: Yes


© 2022, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.In radiology, patients are frequently diagnosed according to the subjective interpretations of radiologists based on an image. Such diagnosis results may be biased and significantly differ among evaluators (i.e., readers) due to different education levels and experiences. One solution to overcome this problem is to use a multi-reader multi-case study design in which there are multiple readers, and the same images are evaluated multiple times. Several methods, including model-based and bootstrap-based, are available for analyzing the multi-reader multi-case studies. In this study, we aimed to compare the performance of available methods on a mammogram dataset. We also conducted a comprehensive simulation study to generalize the results to more general scenarios. We considered the effect of the number of samples and readers, data structures (i.e., correlation structures and variance components), and overall accuracy of diagnostic tests (AUC) in the simulation set-up. Results showed that the model-based methods had type-I error rates close to the nominal level as the number of samples and readers increased. Bootstrap-based methods, on the other hand, were generally conservative. However, they performed the best when the sample size was small, and the AUC level was high. In conclusion, the performance of the proposed methods was not the same under all conditions and was affected by the factors we considered in the simulation study. Therefore, it is not a perfect strategy to use one method under all scenarios because it may lead to biased conclusions.