Conquering confounds and covariates in machine learning

News: here is the video pitch I’ve made for the OHBM hackathon 2020. The associated slides are here.

As I tried to study impact of different deconfounding methods, as well as offer covariate regression ability in neuropredict, I realized the tools and methods I need to implement would be useful to broader machine learning and neuroscience community. Hence, I set aside a couple of weeks to review the relevant literature (esp. in the context of biomarkers and predictive modeling), which convinced me that there are still many open questions! For example, there is no consensus on 1) what really constitutes a confound?, 2) when should we try to deconfound it? and 3) how do we properly assess their impact? etc. That really convinced me even further that we need a common, open and dependable library to conquer confounds in machine learning. So, I’d like to quickly announce the initial release of the beautiful python library called confounds.

Vision / Goals

The high-level and long-term goal of this package is to develop high-quality library to conquer confounds and covariates in ML, biomarker and neuroscience applications.

By conquering, we mean methods and tools to

visualize and establish the presence of confounds (e.g. quantifying confound-to-target relationships),
offer solutions to handle them appropriately via correction or removal etc, and
analyze the effect of the deconfounding methods in the processed data (e.g. ability to check if they worked at all, or if they introduced new or unwanted biases etc).

More docs with usage examples are available in the repo.

Fuller documentation with API reference and an example usage in nested CV is here.

Contributors are most welcome! All the contributors will be credited in the repo, and become authors in the paper (when contributions are non-trivial) to be published when the library is ready.

Relevant references

NOTE: the list below will be maintained in the confounds repo here.

This list will be updated regularly. If anything important or relevant or latest missing in the list below, send me a pull request to the above file. Thanks. Check this guide to learn how.

VanderWeele, T.J. and Shpitser, I., 2013. On the definition of a confounder. Annals of statistics, 41(1), p.196. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4276366/
Goh, W. W. B., Wang, W., & Wong, L. (2017). Why Batch Effects Matter in Omics Data, and How to Avoid Them. Trends in Biotechnology, 35(6), 498–507. https://doi.org/10.1016/j.tibtech.2017.02.012
Görgen, K., Hebart, M. N., Allefeld, C., & Haynes, J.-D. (2018). The same analysis approach: Practical protection against the pitfalls of novel neuroimaging analysis methods. NeuroImage, 180, 19–30. https://doi.org/10.1016/j.neuroimage.2017.12.083
Hyatt, C. S., Owens, M. M., Crowe, M. L., Carter, N. T., Lynam, D. R., & Miller, J. D. (2019). The quandary of covarying: A brief review and empirical examination of covariate use in structural neuroimaging studies on psychological variables. NeuroImage, 116225. https://doi.org/10.1016/j.neuroimage.2019.116225
Kaltenpoth, D., & Vreeken, J. (2019). We Are Not Your Real Parents: Telling Causal from Confounded using MDL. In Proceedings of the 2019 SIAM International Conference on Data Mining (Vol. 1–0, pp. 199–207). Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611975673.23
Linn, K. A., Gaonkar, B., Doshi, J., Davatzikos, C., & Shinohara, R. T. (2015). Addressing Confounding in Predictive Models with an Application to Neuroimaging. The International Journal of Biostatistics, 12(1), 31–44. https://doi.org/10.1515/ijb-2015-0030
Nguyen, H., Morris, R. W., Harris, A. W., Korgoankar, M. S., & Ramos, F. (2018). Correcting differences in multi-site neuroimaging data using Generative Adversarial Networks. ArXiv:1803.09375 [Cs]. http://arxiv.org/abs/1803.09375
Parr, T., & Wilson, J. D. (2019). A Stratification Approach to Partial Dependence for Codependent Variables. ArXiv:1907.06698 [Cs, Stat]. http://arxiv.org/abs/1907.06698
Rao, A., Monteiro, J. M., Mourão-Miranda, J., & Alzheimer’s Disease Initiative. (2017). Predictive Modelling using Neuroimaging Data in the Presence of Confounds. NeuroImage. https://doi.org/10.1016/j.neuroimage.2017.01.066
Snoek, L., Miletić, S., & Scholte, H. S. (2019). How to control for confounds in decoding analyses of neuroimaging data. NeuroImage, 184, 741–760. https://doi.org/10.1016/j.neuroimage.2018.09.074

cross invalidation

Conquering confounds and covariates in machine learning

Vision / Goals

Relevant references

2 thoughts on “Conquering confounds and covariates in machine learning”

Leave a comment Cancel reply

Vision / Goals

Relevant references

Share this:

2 thoughts on “Conquering confounds and covariates in machine learning”

Leave a comment Cancel reply