For those interested in a deeper understanding of computer-aided diagnosis models and how their performance is evaluated, the following is a quick list of papers that have helped me immensely, and I strongly recommend you to read them as well.
This list is not about great machine learning algorithms or techniques, but about evaluating their generalization performance rigorously and engaging in a proper model comparison. Hence it is by no means exhaustive, nor is it meant to be a living and breathing document to serve you as the list of the latest and greatest in machine learning. Neither is it deep :), as there are no papers on deep learning here (a great list on DL literature is here). However, I am pretty sure proper performance evaluation applies to the DL literature as well.
I am sure I missed some great papers. If you have any suggestions for great papers, or links to other great lists, let me know so I can try add them here.
If this is found to be useful for others, I can add them to a github repository that could be edited by everyone.
References in alphabetical order
- Airola, A., Pahikkala, T., Waegeman, W., De Baets, B., & Salakoski, T. (2009, March). A comparison of AUC estimators in small-sample studies. In Machine Learning in Systems Biology (pp. 3-13).
- Airola, A., Pahikkala, T., Waegeman, W., De Baets, B., & Salakoski, T. (2011). An experimental comparison of cross-validation techniques for estimating the area under the ROC curve. Computational Statistics & Data Analysis, 55(4), 1828-1844.
- Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79.
- Banerjee, B. (2006). Method of manufactured solutions. Mechanics, 20, 427-438.
- Benavoli, A., Corani, G., & Mangili, F. (2016). Should We Really Use Post-Hoc Tests Based on Mean-Ranks? Journal of Machine Learning Research, 17(5), 1–10.
- Borra, S., & Di Ciaccio, A. (2010). Measuring the prediction error. A comparison of cross-validation, bootstrap and covariance penalty methods. Computational Statistics & Data Analysis, 54(12), 2976–2989. http://doi.org/10.1016/j.csda.2010.03.004
- Boyd, J. C. (1997). Mathematical tools for demonstrating the clinical usefulness of biochemical markers. Scandinavian Journal of Clinical and Laboratory Investigation, 57(sup227), 46-63.
- Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145–1159. http://doi.org/10.1016/S0031-3203(96)00142-2
- Burman, P. (1989). A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika, 76(3), 503–514. http://doi.org/10.1093/biomet/76.3.503
- Cawley, G. C., & Talbot, N. L. C. (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research, 11(Jul), 2079–2107.
- Cook, N. R. (2007). Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation, 115(7), 928–935. http://doi.org/10.1161/CIRCULATIONAHA.106.672402
- DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(3), 837. http://doi.org/10.2307/2531595
- Demler, O. V., Pencina, M. J., & D’Agostino, R. B. (2011). Equivalence of improvement in area under ROC curve and linear discriminant analysis coefficient under assumption of normality. Statistics in Medicine. http://doi.org/10.1002/sim.5328
- Demler, O. V., Pencina, M. J., & D’Agostino, R. B., Sr. (2012). Misuse of DeLong test to compare AUCs for nested models. Statistics in Medicine, 31(23), 2577–2587. http://doi.org/10.1002/sim.5328
- Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research, 7(Jan), 1-30.
- Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7), 1895-1923.
- Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78-87.
- Efron, B. (2004). The estimation of prediction error: covariance penalties and cross-validation. Journal of the American Statistical Association, 99(467), 619-632.
- Efron, B., & Tibshirani, R. (1997). Improvements on Cross-Validation: The 632+ Bootstrap Method. Journal of the American Statistical Association, 92(438), 548–560. http://doi.org/10.1080/01621459.1997.10474007
- Burman, P. (1990). Estimation of optimal transformations using v-fold cross validation and repeated learning-testing methods. Sankhyā: The Indian Journal of Statistics, Series A, 314-345.
- Evgeniou, T., Pontil, M., & Elisseeff, A. (2004). Leave one out error, stability, and generalization of voting combinations of classifiers. Machine Learning, 55(1), 71-97.
- Forman, G. (2002). A method for discovering the insignificance of one’s best classifier and the unlearnability of a classification task. In Proceedings of th First International Workshop on Data Mining Lessons Learned (DMLL-2002). Available: http://www. hpl. hp. com/personal/Tom_Fawcett/DMLL-2002/Forman. pdf.
- Forman, G., & Scholz, M. (2010). Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. ACM SIGKDD Explorations Newsletter, 12(1), 49-57.
- Hanczar, B., Hua, J., Sima, C., Weinstein, J., Bittner, M., & Dougherty, E. R. (2010). Small-sample precision of ROC-related estimates. Bioinformatics, 26(6), 822-830.
- Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine learning, 45(2), 171-186.
- Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36.
- Kearns, M. (1997). A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split. Neural Computation, 9(5), 1143–1161. http://doi.org/10.1162/neco.1922.214.171.1243
- Kearns, M., & Ron, D. (1999). Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural computation, 11(6), 1427-1453.
- Kohavi, R. (1995a). A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence.
- Kohavi, R. (1995, August). A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI (Vol. 14, No. 2, pp. 1137-1145).
- Markatou, M., Tian, H., Biswas, S., & Hripcsak, G. (2006). Analysis of variance of cross-validation estimators of the generalization error. The Journal of Machine Learning Research, 6(2), 1127.
- McClish, D. K. (1989). Analyzing a portion of the ROC curve. Medical Decision Making, 9(3), 190-195.
- Metz, C. E. (2008). ROC analysis in medical imaging: a tutorial review of the literature. Radiological Physics and Technology, 1(1), 2–12. http://doi.org/10.1007/s12194-007-0002-1
- Molinaro, A. M., Simon, R., & Pfeiffer, R. M. (2005). Prediction error estimation: a comparison of resampling methods. Bioinformatics, 21(15), 3301–3307. http://doi.org/10.1093/bioinformatics/bti499
- Poggio, T., & Smale, S. (2003). The mathematics of learning: Dealing with data. Notices of the AMS, 50(5), 537-544.
- Provost, F. J., Fawcett, T., & Kohavi, R. (1998, July). The case against accuracy estimation for comparing induction algorithms. In ICML (Vol. 98, pp. 445-453).
- Seong Ho Park, J. M. G. C.-H. J. (2004). Receiver Operating Characteristic (ROC) Curve: Practical Review for Radiologists. Korean Journal of Radiology, 5(1), 11–18. http://doi.org/10.3348/kjr.2004.5.1.11
- Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association, 88(422), 486–494.
- Steyerberg, E. W., Calster, B. V., & Pencina, M. J. (2011). Performance measures for prediction models and markers: Evaluation of predictions and classifications. Revista Española de Cardiología (English Edition), 64(9), 788–794. https://doi.org/10.1016/j.rec.2011.05.004
- Steyerberg, E. W., Vickers, A. J., Cook, N. R., Gerds, T., Gonen, M., Obuchowski, N., Pencina, M. J., & Kattan, M. W. (2010). Assessing the performance of prediction models: A framework for traditional and novel measures. Epidemiology, 21(1), 128–138. https://doi.org/10.1097/EDE.0b013e3181c30fb2
- Stone, M. (1977). Asymptotics For and Against Cross-Validation. Biometrika, 64(1), 29. http://doi.org/10.2307/2335766
- Varma, S. (2006). Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics.
- Wainer, J., & Cawley, G. (2017). Empirical Evaluation of Resampling Procedures for Optimising SVM Hyperparameters. Journal of Machine Learning Research, 18(15), 1–35.
- Zinkevich, M. (2017). Rules of Machine Learning: Best Practices for ML Engineering. Link: http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf
You can blame my reference management app (papersapp.com) for incomplete or incorrect or ill-formated citations.
keywords: performance evaluation, model selection, kaggle, comparison, cross validation