TY - THES
T1 - Model selection by cross-validation in multi-environment trials
A1 - Hadasch,Steffen
Y1 - 2018/10/23
N2 - In plant breeding, estimation of the performance of genotypes across a set of tested environments (genotype means), and the estimation of the environment-specific performances of the genotypes (genotype-environment means) are important tasks. For this purpose, breeders conduct multi-environment trials (MET) in which a set of genotypes is tested in a set of environments. The data from such experiments are typically analysed by mixed models as such models for example allow modelling the genotypes using random effects which may be correlated according to their genetic information. The data from MET are often high-dimensional and the covariance matrix of the data may contain many parameters that need to be estimated. To circumvent computational burdens, the data can be analysed in a stage-wise fashion. In the stage-wise analysis, the covariance matrix of the data needs to be taken into account in the estimation of the individual stages. In the analysis of MET data, there is usually a set of candidate models from which the one that fits bets to the objective of the breeder needs to be determined. Such a model selection can be done by cross validation (CV). In the application of CV schemes, different objectives of the breeder can be evaluated using an appropriate sampling strategy. In the application of a CV, both the sampling strategy and the evaluation of the model need to take the correlation of the data into account to evaluate the model performance adequately.
In this work, two different types of models that are used for the analysis of MET were focused. In Chapter 2, models that use genetic marker information to estimate the genotype means were considered. In Chapters 3 and 4, the estimation of genotype-environment means using models that include multiplicative terms to describe the genotype-environment interaction, namely the additive main effects and multiplicative interaction (AMMI), and the genotype and genotype-environment interaction (GGE) model, were focused. In all the Chapters, the models were estimated in a stage-wise fashion. Furthermore, CV was used in Chapters 2 and 3 to determine the most appropriate model from a set of candidate models.
In Chapter 2, two traits of a biparental lettuce (Lactuca sativa L.) population were analysed by models for (i) phenotypic selection, (ii) marker-assisted selection using QTL-linked markers, (iii) genomic prediction using all available molecular markers, and (iv) a combination of genomic prediction and QTL-linked markers. Using different sampling strategies in a CV, the predictive performances of these models were compared in terms of different objectives of a breeder, namely predicting unobserved genotypes, predicting genotypes in an unobserved environment, and predicting unobserved genotypes in an unobserved environment. Generally, the genomic prediction model outperformed marker assisted and phenotypic selection when there are only a few markers with large effects, while the marker assisted selection outperformed genomic prediction when the number of markers with large effects increases. Furthermore, the results obtained for the different objectives indicate that the predictive performance of the models in terms of predicting (unobserved) genotypes in an unobserved environment is reduced due to the presence of genotype-environment interaction.
In AMMI/GGE models, the number of multiplicative terms can be determined by CV. In Chapter 3, different CV schemes were compared in a simulation study in terms of recovering the true (simulated) number of multiplicative terms, and in terms of the mean squared error of the estimated genotype-environment means. The data were simulated using the estimated variance components of a randomized complete block design and a resolvable incomplete block design. The effects of the experimental design (replicates and blocks) need to be taken into account in the application of a CV in order to evaluate the predictive performance of the model adequately. In Chapter 3, the experimental design was accounted for by an adjustment of the data for the design effects estimated from all data before applying a CV scheme. The results of the simulation study show that an adjustment of the data is required to determine the number of multiplicative terms in AMMI/GGE models. Furthermore, the results indicate that different CV schemes can be used with similar efficiencies provided that the data were adjusted adequately.
AMMI/GGE models are typically estimated in a two-stage analysis in which the first stage consists of estimating the genotype-environment means while the second stage consists of estimating main effects of genotypes and environments and the multiplicative interaction. The genotype-environment means estimated in the first stage are not independent when effects of the experimental design are modelled as random effects. In such a case, estimation of the second stage should be done by a weighted (generalized least squares) estimation where a weighting matrix is used to take the covariance matrix of the estimated genotype-environment means into account. In Chapter 4, three different algorithms which can take the full covariance matrix of the genotype-environment means into account are introduced to estimate the AMMI/GGE model in a weighted fashion. To investigate the effectiveness of the weighted estimation, the algorithms were implemented using different weighting matrices, including (i) an identity matrix (unweighted estimation), (ii) a diagonal approximation of the inverse covariance matrix of the genotype-environment means, and (iii) the full inverse covariance matrix. The different weighting strategies were compared in a simulation study in terms of the mean squared error of the estimated genotype-environment means, multiplicative interaction effects, and Biplot coordinates. The results of the simulation study show that weighted estimation of the AMMI/GGE model generally outperformed unweighted estimation. Furthermore, the effectiveness of a weighted estimation increased when the heterogeneity in the covariance matrix of the estimated genotype-environment means increased.
The analysis of MET in a stage-wise fashion is an efficient procedure to estimate a model for MET data, whereas the covariance structure of the data needs to be taken into account in each stage in order to estimate the model appropriately. When correlated data are used in a CV, the correlation can be taken into account by an appropriate choice of training and validation data, by an adjustment of the data before applying a CV scheme and by the success criterion used in a CV scheme.
KW - Kreuzvalidierung
KW - genomweite Selektion
KW - markergestützte Selektion
KW - multiplikative Modelle
CY - Hohenheim
PB - Kommunikations-, Informations- und Medienzentrum der Universität Hohenheim
AD - Garbenstr. 15, 70593 Stuttgart
UR - http://opus.uni-hohenheim.de/volltexte/2018/1512
ER -