A modification of the Lasso method by using the Bahadur representation for the genome-wide association study

Lev V. Utkin, Yulia A. Zhuk


A modification of the Lasso method as a powerful machine learning tool applied to a genome-wide association study is proposed in the paper. From the machine learning point of view, a feature selection problem is solved in the paper, where features are single nucleotide polymorphisms or DNA-markers whose association with a quantitative trait is established. The main idea underlying the modification is to take into account correlations between DNA-markers and peculiarities of phenotype values by using the Bahadur representation of joint probabilities of binary random variables. Interactions of DNA-markers called the epistasis are also considered in the framework of the proposed modification. Various numerical experiments with real datasets illustrate the proposed modification.

Full Text:



W. Altidor, T.M. Khoshgoftaar, J. Van Hulse, and A. Napolitano. Ensemble feature ranking methods for data intensive computing applications. In B. Furht and A. Escalante, editors, Handbook of Data Intensive Computing, pages 349-376. Springer, New York, 2011.

R.R. Bahadur. A representation of the joint distribution of response to n dichotomous items. In H. Solomon, editor, Studies in Item Analysis and Prediction, pages 158-168. Stanford University Press, Palo Alto, CA, 1961.

A.L. Beam, A. Motsinger-Reif, and J. Doyle. Bayesian neural networks for detecting epistasis in genetic association studies. BMC Bioinformatics, 15(368):1-12, 2014.

P.J. Bickel and E. Levina. Some theory for fisher's linear discriminant function, 'naive bayes', and some alternatives when there are many more variables than observations. Bernoulli, 10(6):989-1010, 2004.

J. Bocianowski. Estimation of epistasis in doubled haploid barley populations considering interactions between all possible marker pairs. Euphytica, 196(1):105-115, 2014.

P. Buhlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer, Berlin Heidelberg, 2011.

L. Chen, G. Yu, C.D. Langefeld, D.J. Miller, R.T. Guy, J. Raghuram, X. Yuan, D.M. Herrington, and Y. Wang. Comparative analysis of methods for detecting interacting loci. BMC Genomics, 12:344:1-23, 2011.

Y. Chutimanitsakun, R.W. Nipper, A. Cuesta-Marcos, L. Cistue, A. Corey, T. Filichkina, E.A. Johnson, and P.M. Hayes. Construction and application for qtl analysis of a restriction site associated dna (rad) linkage map in barley. BMC Genomics, 12:4:1-13, 2011.

L. Cistue, A. Cuesta-Marcos, S. Chao, B. Echavarri, Y. Chutimanitsakun, A. Corey, T. Filichkina, N. Garcia-Marino, I. Romagosa, and P.M. Hayes. Comparative mapping of the oregon wolfe barley using doubled haploid lines derived from female and male gametes. Theoretical and applied genetics, 122(7):1399-1410, 2011.

T.J. Close, P.R. Bhat, S. Lonardi, Y. Wu, N. Rostoks, L. Ramsay, A. Druka, N. Stein, J.T. Svensson, S. Wanamaker, S. Bozdag, M.L. Roose, M.J. Moscou, S. Chao, R.K. Varshney, P. Szucs, K. Sato, P.M. Hayes, D.E. Matthews, A. Kleinhofs, G.J. Muehlbauer, J. DeYoung, D.F. Marshall, K. Madishetty, R.D. Fenton, P. Condamine, A. Graner, and R. Waugh. Development and implementation of high-throughput SNP genotyping in barley. BMC Genomics, 10:582:1-13, 2009.

H.J. Cordell. Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Human Molecular Genetics, 11(20):2463-2468, 2002.

G. de los Campos, D. Gianola, and D.B. Allison. Predicting genetic predisposition in humans: the promise of whole-genome markers. Nature Reviews Genetics, 11(12):880-886, 2010.

Z. Feng, X. Yang, S. Subedi, and P.D. McNicholas. The LASSO and sparse least squares regression methods for SNP selection in predicting quantitative traits. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 9(2):629-636, 2012.

J.H. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1-22, 2010.

M.E. Goddard, N.R. Wray, K. Verbyla, and P.M. Visscher. Estimating effects and making predictions from genome-wide marker data. Statistical Science, 24(4):517-529, 2009.

X. Gu, G. Yin, and J.J. Lee. Bayesian two-step Lasso strategy for biomarker selection in personalized medicine development for time-to-event endpoints. Contemporary Clinical Trials, 36(2):642 - 650, 2013.

I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 46:389–422, 2002.

B. Hayes. Overview of statistical methods for genome-wide association studies (GWAS). Methods in Molecular Biology, 1019:149-169, 2013.

P. Hayes, F.Q. Chen, A. Corey, A. Pan, T.H.H. Chen, E. Baird, W. Powell, W. Thomas, R. Waugh, Z. Bedo, I. Karsai, T. Blake, and L. Oberthur. The dicktoo x morex population. In PaulH. Li and TonyH.H. Chen, editors, Plant Cold Hardiness, pages 77-87. Springer US, 1997.

P.M. Hayes, T. Blake, T.H.H. Chen, S. Tragoonrung, F. Chen and.A. Pan, and B. Liu. Quantitative trait loci on barley (Hordeum vulgare L.) chromosome 7 associated with components of winterhardiness. Genome, 36(1):66-71, 1993.

P.M. Hayes and O. Jyambo. Summary of QTL effects in the steptoe x morex population. Barley genetics newsletter, 23:98-143, 1993.

A. Huang, S. Xu, and X. Cai. Empirical Bayesian LASSO-logistic regression for multiple binary trait locus mapping. BMC Genetics, 14(5):1-14, 2013.

O. Kohannim, D.P. Hibar, J.L. Stein, N. Jahanshad, Xue Hua, P. Rajagopalan, A.W. Toga, C.R. Jack Jr., M.W. Weiner, G.I. de Zubicaray, K.L. McMahon, N.K. Hansell, N.G. Martin, M.J. Wright, P.M. Thompson, and The Alzheimer's Disease Neuroimaging Initiative. Discovery and replication of gene influences on brain structure using LASSO regression. Frontiers in Neuroscience, 6:1-13, 2012.

R. Kohavi and G.H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1-2):273-324, 1997.

T.N. Lal, O. Chapelle, J. Weston, and A. Elisseeff. Embedded methods. In Feature extraction, volume 207 of Studies in Fuzziness and Soft Computing, pages 137-165. Springer, Berlin Heidelberg, 2006.

H.O. Lancaster. The structure of bivariate distributions. The Annals of Mathematical Statistics, 29(3):719-736, 1958.

E.S. Lander and D. Botstein. Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics, 121(1):185-199, 1989.

I.-H. Lee, G.H. Lushington, and M. Visvanathan. A filter-based feature selection approach for identifying potential biomarkers for lung cancer. Journal of Clinical Bioinformatics, 1(11):1-8, 2011.

S.-H. Lee and C.-H. Jun. Discriminant analysis of binary data following multivariate Bernoulli distribution. Expert Systems with Applications, 38(6):7795-7802, 2011.

J. Li, B. Horstman, and Y. Chen. Detecting epistatic effects in association studies at a genomic level based on an ensemble approach. Bioinformatics, 27(13):i222-i229, 2011.

Z. Li and M.J. Sillanpaa. Overview of LASSO-related penalized regression methods for quantitative trait mapping and genomic selection. Theoretical and Applied Genetics, 125(3):419-435, 2012.

B.G. Lindsay, G.Y. Yi, and J. Sun. Issues and strategies in the selection of composite likelihoods. Statistica Sinica, 21(1):71-105, 2011.

J. Liu, J. Huang, S. Ma, and K. Wang. Incorporating group correlations in genome-wide association studies using smoothed group LASSO. Biostatistics, 14(2):205-219, 2013.

B.E. Madsen and S.R. Browning. A groupwise association test for rare mutations using a weighted sum statistic. PLoS genetics, 5(2):e1000384, 2009.

C.M. Mutshinda and M.J. Sillanpaa. Extended bayesian LASSO for multiple quantitative trait loci mapping and unobserved phenotype prediction. Genetics, 186(3):1067-1075, 2010.

J.O Ogutu, T. Schulz-Streeck, and H.-P. Piepho. Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions. BMC Proceedings, 6(2):1-10, 2012.

A. Pan, P.M. Hayes, F. Chen, T.H.H. Chen, T. Blake, S. Wright, I. Karsai, and Z. Bedo. Genetic analysis of the components of winterhardiness in barley (Hordeum vulgare L.). Theoretical and Applied Genetics, 89(7-8):900-910, 1994.

M.Y. Park and T. Hastie. Regularization path algorithms for detecting gene interactions. Technical Report 2006-13, Department of Statistics, Stanford University, 2006.

O.V. Sarmanov. The maximum correlation coefficient (symmetric case). Dokl. Akad. Nauk SSSR, 120:715-718, 1958.

S. Subedi, Z. Feng, R. Deardon, and F.S. Schenkel. SNP selection for predicting a quantitative trait. Journal of Applied Statistics, 40(3):600-613, 2013.

R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267-288, 1996.

G. Tutz and J. Ulbricht. Penalized regression with correlation-based penalty. Statistics and Computing, 19(3):239-253, 2009.

M.G Usai, A. Carta, and S. Casu. Alternative strategies for selecting subsets of predicting SNPs by LASSO-LARS procedure. BMC Proceedings, 6(2):1-9, 2012.

X. Wan, C. Yang, Q. Yang, H. Xue, X. Fan, N.L.S. Tang, and W. Yu. BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies. The American Journal of Human Genetics, 87(3):325-340, 2010.

Y. Wang, G. Liu, M. Feng, and L. Wong. An empirical comparison of several recent epistatic interaction detection methods. Bioinformatics, 27(21):2936-2943, 2011.

H. Warren, J.-P. Casas, A. Hingorani, F. Dudbridge, and J. Whittaker. Genetic prediction of quantitative lipid traits: Comparing shrinkage models to gene scores. Genetic Epidemiology, 38(1):72-83, 2014.

N.R. Wray, Jian Yang, B.J. Hayes, A.L. Price, M.E. Goddard, and P.M. Visscher. Pitfalls of predicting complex traits from SNPs. Nature Reviews Genetics, 14:507-515, 2013.

J. Wu, B. Devlin, S. Ringquist, M. Trucco, and K. Roeder. Screen and clean: a tool for identifying interactions in genome-wide association studies. Genetic Epidemiol., 34(3):275-285, 2010.

J. Xia, S. Visweswaran, and R.E. Neapolitan. Mining epistatic interactions from high-dimensional data sets. In D.E. Holmes and L.C. Jain, editors, Data Mining: Foundations and Intelligent Paradigms, pages 187-209. Springer, Berlin Heidelberg, 2012.

P. Yang, J. Ho, A. Zomaya, and B. Zhou. A genetic ensemble approach for gene-gene interaction identification. BMC Bioinformatics, 11(1):524, 2010.

C. Yao, D.M. Spurlock, L.E. Armentano, C.D. Page Jr., M.J. VandeHaar, D.M. Bickhart, and K.A. Weigel. Random forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle. Journal of Dairy Science, 96(10):6716-6729, 2013.

N. Yi, B.S. Yandell, G.A. Churchill, D.B. Allison, E.J. Eisen, and D. Pomp. Bayesian model selection for genome-wide epistatic quantitative trait loci analysis. Genetics, 170(3):1333-1344, 2005.

X. Zhang, S. Huang, Z. Zhang, and W. Wang. Chapter 10: Mining genome-wide genetic markers. PLoS Computational Biology, 8(12):e1002828, 2012.

Y. Zhang, B. Jiang, J. Zhu, and J.S. Liu. Bayesian models for detecting epistatic interactions from genetic data. Annals of Human Genetics, 75(1):183-193, 2011.

H. Zhou, D.H. Alexander, M.E. Sehl, J.S. Sinsheimer, and K. Lange. Penalized regression for genome-wide association screening of sequence data. In Pacific Symposium on Biocomputing, pages 106-117. World Scientific Publishing, 2011.

H. Zou. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418-1429, 2006.

H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301-320, 2005.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.