Must read machine learning papers 

For those interested in a deeper understanding of computer-aided diagnosis models and how their performance is evaluated, the following is a quick list of papers that have helped me immensely, and I strongly recommend you to read them as well.

This list is not about great machine learning algorithms or techniques, but about evaluating their generalization performance rigorously and engaging in a proper model comparison. Hence it is by no means exhaustive, nor is it meant to be a living and breathing document to serve you as the list of the latest and greatest in machine learning. Neither is it deep :), as there are no papers on deep learning here (a great list on DL literature is here). However, I am pretty sure proper performance evaluation applies to the DL literature as well.

I am sure I missed some great papers. If you have any suggestions for great papers, or links to other great lists, let me know so I can try add them here.

If this is found to be useful for others, I can add them to a github repository that could be edited by everyone.

References in alphabetical order

  • Airola, A., Pahikkala, T., Waegeman, W., De Baets, B., & Salakoski, T. (2009, March). A comparison of AUC estimators in small-sample studies. In Machine Learning in Systems Biology (pp. 3-13).
  • Airola, A., Pahikkala, T., Waegeman, W., De Baets, B., & Salakoski, T. (2011). An experimental comparison of cross-validation techniques for estimating the area under the ROC curve. Computational Statistics & Data Analysis, 55(4), 1828-1844.
  • Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79.
  • Banerjee, B. (2006). Method of manufactured solutions. Mechanics, 20, 427-438.
  • Benavoli, A., Corani, G., & Mangili, F. (2016). Should We Really Use Post-Hoc Tests Based on Mean-Ranks? Journal of Machine Learning Research, 17(5), 1–10.
  • Borra, S., & Di Ciaccio, A. (2010). Measuring the prediction error. A comparison of cross-validation, bootstrap and covariance penalty methods. Computational Statistics & Data Analysis, 54(12), 2976–2989.
  • Boyd, J. C. (1997). Mathematical tools for demonstrating the clinical usefulness of biochemical markers. Scandinavian Journal of Clinical and Laboratory Investigation, 57(sup227), 46-63.
  • Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145–1159.
  • Burman, P. (1989). A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika, 76(3), 503–514.
  • Cawley, G. C., & Talbot, N. L. C. (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research, 11(Jul), 2079–2107.
  • Cook, N. R. (2007). Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation, 115(7), 928–935.
  • DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(3), 837.
  • Demler, O. V., Pencina, M. J., & D’Agostino, R. B. (2011). Equivalence of improvement in area under ROC curve and linear discriminant analysis coefficient under assumption of normality. Statistics in Medicine.
  • Demler, O. V., Pencina, M. J., & D’Agostino, R. B., Sr. (2012). Misuse of DeLong test to compare AUCs for nested models. Statistics in Medicine, 31(23), 2577–2587.
  • Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research, 7(Jan), 1-30.
  • Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7), 1895-1923.
  • Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78-87.
  • Efron, B. (2004). The estimation of prediction error: covariance penalties and cross-validation. Journal of the American Statistical Association, 99(467), 619-632.
  • Efron, B., & Tibshirani, R. (1997). Improvements on Cross-Validation: The 632+ Bootstrap Method. Journal of the American Statistical Association, 92(438), 548–560.
  • Burman, P. (1990). Estimation of optimal transformations using v-fold cross validation and repeated learning-testing methods. Sankhyā: The Indian Journal of Statistics, Series A, 314-345.
  • Evgeniou, T., Pontil, M., & Elisseeff, A. (2004). Leave one out error, stability, and generalization of voting combinations of classifiers. Machine Learning, 55(1), 71-97.
  • Forman, G. (2002). A method for discovering the insignificance of one’s best classifier and the unlearnability of a classification task. In Proceedings of th First International Workshop on Data Mining Lessons Learned (DMLL-2002). Available: http://www. hpl. hp. com/personal/Tom_Fawcett/DMLL-2002/Forman. pdf.
  • Forman, G., & Scholz, M. (2010). Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. ACM SIGKDD Explorations Newsletter, 12(1), 49-57.
  • Hanczar, B., Hua, J., Sima, C., Weinstein, J., Bittner, M., & Dougherty, E. R. (2010). Small-sample precision of ROC-related estimates. Bioinformatics, 26(6), 822-830.
  • Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine learning, 45(2), 171-186.
  • Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36.
  • Kearns, M. (1997). A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split. Neural Computation, 9(5), 1143–1161.
  • Kearns, M., & Ron, D. (1999). Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural computation, 11(6), 1427-1453.
  • Kohavi, R. (1995a). A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence.
  • Kohavi, R. (1995, August). A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI (Vol. 14, No. 2, pp. 1137-1145).
  • Markatou, M., Tian, H., Biswas, S., & Hripcsak, G. (2006). Analysis of variance of cross-validation estimators of the generalization error. The Journal of Machine Learning Research, 6(2), 1127.
  • McClish, D. K. (1989). Analyzing a portion of the ROC curve. Medical Decision Making, 9(3), 190-195.
  • Metz, C. E. (2008). ROC analysis in medical imaging: a tutorial review of the literature. Radiological Physics and Technology, 1(1), 2–12.
  • Molinaro, A. M., Simon, R., & Pfeiffer, R. M. (2005). Prediction error estimation: a comparison of resampling methods. Bioinformatics, 21(15), 3301–3307.
  • Poggio, T., & Smale, S. (2003). The mathematics of learning: Dealing with data. Notices of the AMS, 50(5), 537-544.
  • Provost, F. J., Fawcett, T., & Kohavi, R. (1998, July). The case against accuracy estimation for comparing induction algorithms. In ICML (Vol. 98, pp. 445-453).
  • Seong Ho Park, J. M. G. C.-H. J. (2004). Receiver Operating Characteristic (ROC) Curve: Practical Review for Radiologists. Korean Journal of Radiology, 5(1), 11–18.
  • Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association, 88(422), 486–494.
  • Steyerberg, E. W., Calster, B. V., & Pencina, M. J. (2011). Performance measures for prediction models and markers: Evaluation of predictions and classifications. Revista Española de Cardiología (English Edition)64(9), 788–794.
  • Steyerberg, E. W., Vickers, A. J., Cook, N. R., Gerds, T., Gonen, M., Obuchowski, N., Pencina, M. J., & Kattan, M. W. (2010). Assessing the performance of prediction models: A framework for traditional and novel measures. Epidemiology21(1), 128–138.
  • Stone, M. (1977). Asymptotics For and Against Cross-Validation. Biometrika, 64(1), 29.
  • Varma, S. (2006). Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics.
  • Wainer, J., & Cawley, G. (2017). Empirical Evaluation of Resampling Procedures for Optimising SVM Hyperparameters. Journal of Machine Learning Research, 18(15), 1–35.
  • Zinkevich, M. (2017). Rules of Machine Learning: Best Practices for ML Engineering. Link:

You can blame my reference management app ( for incomplete or incorrect or ill-formated citations.

keywords: performance evaluation, model selection, kaggle, comparison, cross validation

One thought on “Must read machine learning papers 

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s