Quantitative comparison of scikit-learn’s predictive models on a large number of machine learning datasets: A good start

Open review for Olson, R. S., La Cava, W., Mustahsan, Z., Varik, A., & Moore, J. H. (2017). Data-driven Advice for Applying Machine Learning to Bioinformatics ProblemsarXiv preprint Link: https://arxiv.org/abs/1708.05070

A shameless plug to check out my ML projects, share them and/or contribute. Also

Background

Given the need for accurate classifiers in various biomedical domains, choice overload in machine learning (ML) community and no clear guidelines towards making a choice, quantitative comparisons will be useful, esp. to a non-ML expert. This paper counts among them for many reasons, not only for its sheer effort and potential utility, but for sharing their code as well. I’d like to commend the authors on these aspects.

Strengths

  • Enormous effort in producing these results 👏🏽
  • Usage of scikit-learn implementation 👍
  • Well-designed experiments and well-written paper
  • Organizing and sharing Penn Machine Learning Benchmark (PMLB) datasets
  • Sharing the results and code in an accessible manner 🙌
  • Attempted focus on bioinformatics datasets.

Areas for improvement

  • Cross-validation procedure is not repeated and hence the results are likely unstable, and hence could not be used for proper statistical comparison.
  • Tying in to the CV choice, choice of outperformance (difference in accuracy over 1%) metric is rather arbitrary! A non-parametric Friedman test (on results from above) would be a more prudent choice.
  • Missing citation to important prior work below and discussion on how the current work stands (strengths and weaknesses) in relation to it:
    • Fernández-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? Journal of Machine Learning Research, 15, 3133–3181.
  • Focus on bioinformatics datasets needs to be refined.

Comments:

  • What’s so “bioinformatics” about this analysis? Authors start motivating the paper with focus on bioinformatics, but that focus is not kept throughout. Discuss how many datasets in PMLB are bioinformatics-oriented, and why should they care about these results. I’d strongly recommend a deeper analysis on fewer bioinformatics datasets than a shallow analysis on a larger and broader number of generic ML datasets.
  • The reasons for selecting this particular subset of 13 algorithms needs to be elaborated:
    • if the reason were to primarily availability in scikit-learn, the title and paper must say so and limit the scope appropriately!
    • if the results were to be applicable to a field in particular (like bioinformatics), its important to consider other packages in that field, or justify their omission.
  • Methods section could be improved by elaborating on the reasoning behind many choices made for this analysis. For example, the choice of  grid search for model selection needs to be justified (as this may be non-uniformly affecting the estimated performance for different algorithms), and discuss its impact in relation to other possible choices.
  • Impact of a choosing a fixed cross-validation (CV) scheme must be briefly discussed. why not 5- or 20-folds? How do the algorithm ranks change?
    • Given overarching interest of this paper is to make recommendations, its important to find algorithms which are robust to these evaluation choices.
    • If it is computationally infeasible to repeat the process with all combinations, select the top 40% algorithms and half of the most important/largest datasets to try different CV schemes.
  • The authors have chosen to normalize their features using zero-mean and unit-variance standardization. It needs to be noted that this choice makes a difference, as other options are available, such as min-max scaling.
    • More importantly, it seems to be the case that standardization has been performed using the entire dataset, prior to any CV. If not, this needs to be explicitly clarified. This implies use of test dataset (even if it appears you are not using target label) to train the classifier, which should be avoided at all costs. This reviewer strongly recommends finding the transformation just from the training set, and applying that on the test set to improve confidence in the CV results, and algorithm rankings.
  • Why wasn’t k-fold CV repeated? As you know, depending on the choice of splits, the performance changes. It is strongly recommended to repeat this a large number of times (100-300) depending on the sample size, and using these performance distributions to compare the algorithms. For an example, see research presented in [2], and a clear python-implementation in neuropredict.
  • If possible, sticking to same partitions in CV scheme, as in [1,2], would be better. This results in paired-estimates, increasing power of few significance tests later on.
  • The choice of “outperformance” metric is weak. 
    • Outperformance: “One algorithm outperforms another if it has at least a 1% higher 10-fold CV balanced accuracy.” Not only is the choice of 1% rather arbitrary, but
    • Proper comparison would be to obtain cross-validated distributions of balanced accuracy (as in [2]) under repeated k-fold or split CV, and perform a non-parametric Friedman test, as the authors already use it in post-hoc analysis.
  • Support the following statements with citations or explanations:
    • “However, the value of focused, reproducible ML experiments is still paramount” I agree with the statement, but this needs to be explained and supported. Hint: machine learning results from one domain often do not generalize to another! Any others?
    • “algorithms with more parameters have an advantage in the sense that they have more training attempts on each dataset”
  • I’m surprised about the statement: “These strong statistical results are surprising given the large set of problems and algorithms compared here”. “large set of datasets” does not imply all classes of problems, and hence No Free Lunch theorem may not necessarily apply, unless author’s are able to show their collection of datasets is actually representative of “all classes of problems” under the sun!
  • I would reserve my interpretation of the results till the new results based on the above recommendations here as well as from other reviewers. This would unfortunately prevent me from reading into the meat of this paper, which are the beautiful visualizations and recommended set of algorithms.

Literature survey:

  • Missing citations to important prior work such as [1].
  • For sake of completeness in the efforts towards general guidelines on building ML models, I’d suggest citing [3, 4].
  • Citations in Introduction section are old, cover more recent literature
  • Add more references in para 2 in Introduction section, linking to blog posts with links to popular ML packages, other than scikit-learn.

Presentation:

  • Sentences could be shortened, with restrained use of adjectives
  • Much longer captions to figures and tables describing all the details, so they can stand alone and help the reader walk through them by themselves.
  • From Figure 2, I couldn’t quite see this statement reflected in it: “there are 9 datasets for which Multinomial NB performs as well as or better than Gradient Tree Boosting”. So either a mistake, or insufficient explanation?
  • More data could be shown for Table 4, in order for the reader to see the reasoning and data behind the decisions. ranges of performance for that classifier, median performance of all classifiers on that dataset.

Desired future work

  • I highly recommend employing better choices for classifier performance evaluation and comparison, as detailed in my comments above.
  • Focusing this work on bioinformatics datasets (fewer, but deeper analysis) would have more utility and better impact.
  • Addition of feature selection: Given the authors utilize scikit-learn’s pipeline mechanism, omission of feature selection/transformation seems an interesting if not an odd choice. This is an important step in ML workflow and is quite often employed in practice, esp. in biomedical domains, given the curse of dimensionality. For this study to have maximum impact, multiplying the combinations for comparison with at least few popular choices of feature selection methods seems necessary. Even if the results will show feature selection would not improve the performance over the current best classifier without feature selection, that would still be an useful finding.

Again, this is an enormous effort with great potential for helping bioinformaticians in their research. And I would like to commend Olson and colleagues for their effort in development of data-driven and useful recommendations for ML users and developers.

A shameless plug to check out my ML projects, share them and/or contribute. Also

References

  1. Fernández-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? Journal of Machine Learning Research, 15, 3133–3181.
  2. Raamana, P.R. and Strother, S.C., 2017, Impact of spatial scale and edge weight on predictive power of cortical thickness networks bioRxiv 170381 http://www.biorxiv.org/content/early/2017/07/31/170381. doi: https://doi.org/10.1101/170381
  3. Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78-87.
  4. Zinkevich, M. (2017). Rules of Machine Learning: Best Practices for ML Engineering.

Disclaimer

Lacking info on how the authors expect it to be reviewed, I am reviewing this paper as a typical submission to a biomedical journal (which this doesn’t appear to be given its relatively short length). However, given the potential of this article to be useful for a subset of researchers, I’ve shared my comments which may be helpful for the authors and other researchers in this area.

Leave a comment