Must read machine learning papers

For those interested in a deeper understanding of computer-aided diagnosis models and how their performance is evaluated, the following is a quick list of papers that have helped me immensely, and I strongly recommend you to read them as well.

This list is not about great machine learning algorithms or techniques, but about evaluating their generalization performance rigorously and engaging in a proper model comparison. Hence it is by no means exhaustive, nor is it meant to be a living and breathing document to serve you as the list of the latest and greatest in machine learning. Neither is it deep :), as there are no papers on deep learning here (a great list on DL literature is here). However, I am pretty sure proper performance evaluation applies to the DL literature as well.

I am sure I missed some great papers. If you have any suggestions for great papers, or links to other great lists, let me know so I can try add them here.

If this is found to be useful for others, I can add them to a github repository that could be edited by everyone.

References in alphabetical order

Airola, A., Pahikkala, T., Waegeman, W., De Baets, B., & Salakoski, T. (2009, March). A comparison of AUC estimators in small-sample studies. In Machine Learning in Systems Biology (pp. 3-13).
Airola, A., Pahikkala, T., Waegeman, W., De Baets, B., & Salakoski, T. (2011). An experimental comparison of cross-validation techniques for estimating the area under the ROC curve. Computational Statistics & Data Analysis, 55(4), 1828-1844.
Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79.
Banerjee, B. (2006). Method of manufactured solutions. Mechanics, 20, 427-438.
Benavoli, A., Corani, G., & Mangili, F. (2016). Should We Really Use Post-Hoc Tests Based on Mean-Ranks? Journal of Machine Learning Research, 17(5), 1–10.
Borra, S., & Di Ciaccio, A. (2010). Measuring the prediction error. A comparison of cross-validation, bootstrap and covariance penalty methods. Computational Statistics & Data Analysis, 54(12), 2976–2989. http://doi.org/10.1016/j.csda.2010.03.004
Boyd, J. C. (1997). Mathematical tools for demonstrating the clinical usefulness of biochemical markers. Scandinavian Journal of Clinical and Laboratory Investigation, 57(sup227), 46-63.
Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145–1159. http://doi.org/10.1016/S0031-3203(96)00142-2
Burman, P. (1989). A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika, 76(3), 503–514. http://doi.org/10.1093/biomet/76.3.503
Cawley, G. C., & Talbot, N. L. C. (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research, 11(Jul), 2079–2107.
Cook, N. R. (2007). Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation, 115(7), 928–935. http://doi.org/10.1161/CIRCULATIONAHA.106.672402
DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(3), 837. http://doi.org/10.2307/2531595
Demler, O. V., Pencina, M. J., & D’Agostino, R. B. (2011). Equivalence of improvement in area under ROC curve and linear discriminant analysis coefficient under assumption of normality. Statistics in Medicine. http://doi.org/10.1002/sim.5328
Demler, O. V., Pencina, M. J., & D’Agostino, R. B., Sr. (2012). Misuse of DeLong test to compare AUCs for nested models. Statistics in Medicine, 31(23), 2577–2587. http://doi.org/10.1002/sim.5328
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research, 7(Jan), 1-30.
Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7), 1895-1923.
Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78-87.
Efron, B. (2004). The estimation of prediction error: covariance penalties and cross-validation. Journal of the American Statistical Association, 99(467), 619-632.
Efron, B., & Tibshirani, R. (1997). Improvements on Cross-Validation: The 632+ Bootstrap Method. Journal of the American Statistical Association, 92(438), 548–560. http://doi.org/10.1080/01621459.1997.10474007
Burman, P. (1990). Estimation of optimal transformations using v-fold cross validation and repeated learning-testing methods. Sankhyā: The Indian Journal of Statistics, Series A, 314-345.
Evgeniou, T., Pontil, M., & Elisseeff, A. (2004). Leave one out error, stability, and generalization of voting combinations of classifiers. Machine Learning, 55(1), 71-97.
Forman, G. (2002). A method for discovering the insignificance of one’s best classifier and the unlearnability of a classification task. In Proceedings of th First International Workshop on Data Mining Lessons Learned (DMLL-2002). Available: http://www. hpl. hp. com/personal/Tom_Fawcett/DMLL-2002/Forman. pdf.
Forman, G., & Scholz, M. (2010). Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. ACM SIGKDD Explorations Newsletter, 12(1), 49-57.
Hanczar, B., Hua, J., Sima, C., Weinstein, J., Bittner, M., & Dougherty, E. R. (2010). Small-sample precision of ROC-related estimates. Bioinformatics, 26(6), 822-830.
Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine learning, 45(2), 171-186.
Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36.
Kearns, M. (1997). A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split. Neural Computation, 9(5), 1143–1161. http://doi.org/10.1162/neco.1997.9.5.1143
Kearns, M., & Ron, D. (1999). Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural computation, 11(6), 1427-1453.
Kohavi, R. (1995a). A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence.
Kohavi, R. (1995, August). A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI (Vol. 14, No. 2, pp. 1137-1145).
Markatou, M., Tian, H., Biswas, S., & Hripcsak, G. (2006). Analysis of variance of cross-validation estimators of the generalization error. The Journal of Machine Learning Research, 6(2), 1127.
McClish, D. K. (1989). Analyzing a portion of the ROC curve. Medical Decision Making, 9(3), 190-195.
Metz, C. E. (2008). ROC analysis in medical imaging: a tutorial review of the literature. Radiological Physics and Technology, 1(1), 2–12. http://doi.org/10.1007/s12194-007-0002-1
Molinaro, A. M., Simon, R., & Pfeiffer, R. M. (2005). Prediction error estimation: a comparison of resampling methods. Bioinformatics, 21(15), 3301–3307. http://doi.org/10.1093/bioinformatics/bti499
Poggio, T., & Smale, S. (2003). The mathematics of learning: Dealing with data. Notices of the AMS, 50(5), 537-544.
Provost, F. J., Fawcett, T., & Kohavi, R. (1998, July). The case against accuracy estimation for comparing induction algorithms. In ICML (Vol. 98, pp. 445-453).
Seong Ho Park, J. M. G. C.-H. J. (2004). Receiver Operating Characteristic (ROC) Curve: Practical Review for Radiologists. Korean Journal of Radiology, 5(1), 11–18. http://doi.org/10.3348/kjr.2004.5.1.11
Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association, 88(422), 486–494.
Steyerberg, E. W., Calster, B. V., & Pencina, M. J. (2011). Performance measures for prediction models and markers: Evaluation of predictions and classifications. Revista Española de Cardiología (English Edition), 64(9), 788–794. https://doi.org/10.1016/j.rec.2011.05.004
Steyerberg, E. W., Vickers, A. J., Cook, N. R., Gerds, T., Gonen, M., Obuchowski, N., Pencina, M. J., & Kattan, M. W. (2010). Assessing the performance of prediction models: A framework for traditional and novel measures. Epidemiology, 21(1), 128–138. https://doi.org/10.1097/EDE.0b013e3181c30fb2
Stone, M. (1977). Asymptotics For and Against Cross-Validation. Biometrika, 64(1), 29. http://doi.org/10.2307/2335766
Varma, S. (2006). Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics.
Wainer, J., & Cawley, G. (2017). Empirical Evaluation of Resampling Procedures for Optimising SVM Hyperparameters. Journal of Machine Learning Research, 18(15), 1–35.
Zinkevich, M. (2017). Rules of Machine Learning: Best Practices for ML Engineering. Link: http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf

You can blame my reference management app (papersapp.com) for incomplete or incorrect or ill-formated citations.

keywords: performance evaluation, model selection, kaggle, comparison, cross validation

cross invalidation

Must read machine learning papers

References in alphabetical order

One thought on “Must read machine learning papers ”

Leave a comment Cancel reply

References in alphabetical order

Share this:

One thought on “Must read machine learning papers ﻿”

Leave a comment Cancel reply

One thought on “Must read machine learning papers ”