Genome wide selection in Citrus breeding
Genome wide selection (GWS) is essential for the genetic improvement of perennial species such as Citrus because of its ability to increase gain per unit time and to enable the efficient selection of characteristics with low heritability. This study assessed GWS efficiency in a population of Citrus and compared it with selection based on phenotypic data. A total of 180 individual trees from a cross between Pera sweet orange (Citrus sinensis Osbeck) and Murcott tangor (Citrus sinensis Osbeck x Citrus reticulata Blanco) were evaluated for 10 characteristics related to fruit quality. The hybrids were genotyped using 5287 DArT_seqTM (diversity arrays technology) molecular markers and their effects on phenotypes were predicted using the random regression - best linear unbiased predictor (rr-BLUP) method. The predictive ability, prediction bias, and accuracy of GWS were estimated to verify its effectiveness for phenotype prediction. The proportion of genetic variance explained by the markers was also computed. The heritability of the traits, as determined by markers, was 16-28%. The predictive ability of these markers ranged from 0.53 to 0.64, and the regression coefficients between predicted and observed phenotypes were close to unity. Over 35% of the genetic variance was accounted for by the markers. Accuracy estimates with GWS were lower than those obtained by phenotypic analysis; however, GWS was superior in terms of genetic gain per unit time. Thus, GWS may be useful for Citrus breeding as it can predict phenotypes early and accurately, and reduce the length of the selection cycle. This study demonstrates the feasibility of genomic selection in Citrus.
The Citrus genus is a major food and economic crop worldwide, and many strategies have been adopted to increase its productivity and quality. Breeding is the preferred strategy to address these issues. When referring to Citrus breeding, it is important to note that the citrus plant itself consists of two parts, the canopy and the rootstock, which are often from two different species and require targeted studies to improve their individual quality and interaction. Phenotypic selection is generally applied, but this is costly and time consuming, especially when the desired traits are expressed at later stages (Machado et al., 2011).
Marker-assisted selection (MAS) was proposed by Lande and Thompson (1990) to obtain faster gains and increase selection efficiency with respect to selection based on phenotypic data only. MAS utilizes phenotypic data and molecular markers in genetic linkage with certain loci controlling desired quantitative traits (quantitative trait loci, QTL). QTL markers are selected following appropriate statistical modeling and are subjected to testing for type II error, i.e., the probability of accepting a false hypothesis (Resende et al., 2014).
In Citrus breeding, the MAS strategy has been adopted to select several traits of economic importance such as resistance to biotic and abiotic stress (Ito et al., 2014) and physicochemical characteristics of fruit. However, the identification of QTL associated with such characteristics only explains a small fraction of the observed genetic variation in each character (Siviero et al., 2002; Siviero et al. 2006; Gussen et al., 2011; Asins et al., 2012); therefore, selection based on these markers is not feasible. The rationale for not using MAS is that traits under selection are governed by many loci with small effects that do not reach statistical significance when used in traditional QTL analysis, and this in turn leads to the identification of only a small number of QTL with major effects (Kemper and Goddard 2012; Resende et al., 2014).
Genome-wide selection (GWS) was first proposed by Meuwissen et al. (2001) and is a form of molecular-marker assisted selection that simultaneously predicts the genetic effects of hundreds or thousands of DNA markers on phenotypes (Resende et al., 2008). These markers, which are in linkage disequilibrium (LD) with the QTL, can have both large and small effects and can account for nearly all of the genetic variation within a quantitative trait (Resende et al., 2008). Additionally, whole genome prediction tends to be more accurate because it better accommodates the variation created by Mendelian segregation during gamete formation (Resende et al., 2008; Zapata-Velenzuela et al., 2013).
The increasing availability and decreasing costs of molecular markers spanning the whole genome has led to the implementation of large-scale genomic selection in breeding programs for different species (Misztal et al., 2009; Zhong et al., 2009; Daetwyler et al., 2010; Grattapaglia and Resende, 2011; Resende et al., 2012; Resende Júnior et al., 2012). GWS has promise for use in breeding programs because of the efficient selection of traits with low heritability, permitting the best use of available genetic resources by selecting appropriate genetic crosses, and increasing genetic gain per unit time as it enables early identification of the best individuals for selection (Legarra et al., 2008; Daetwyler et al., 2013; Resende et al., 2014).
In Citrus species, the juvenile period is long, ranging from 1 to 20 years; although flowering and fruiting may occur within 3-7 years depending on the species. Therefore, there is a delay in the hybridization process and in the selection of desired characteristics in breeding programs. In addition, large areas are required for the development of hybrids, which not only increases the cost of plant maintenance in the field but also limits the number of families and individuals per family that can be evaluated (Talon and Gmitter Junior, 2008; Gmitter Junior et al., 2012). GWS can minimize these undesirable consequences as selection can be performed in juvenile plants, which reduces the interval between generations, increases the intensity of selection and, therefore, the gain per unit time and cost (Resende et al., 2008; Wong and Bernardo, 2008; Heffner et al., 2009; Grattapaglia and Resende, 2011; Kemper and Goddard, 2012).
GWS has rarely been applied to fruit trees, being restricted to apples (Kumar et al., 2012), pears (Iwata et al., 2013), and grapes (Viana et al., 2016). To our knowledge, the present study represents the first attempt to investigate the feasibility of genomic selection in Citrus. This research was carried out to apply GWS to a biparental hybrid population of Citrus using DArT_seqTM markers and phenotypes of 10 traits, to assess its selection efficiency, and compare it to selection based on phenotypic data.
MATERIAL AND METHODS
Characterization of the population and phenotypic traits
The genetic effects of the markers were predicted in a hybrid population of Citrus, which is used as canopy. The population composed by 180 individuals was generated by crossing Pera sweet orange (Citrus sinensis Osbeck) and Murcott tangor (Citrus sinensis Osbeck x Citrus reticulata Blanco).
The experiment was established in 2004 at Cordeirópolis, SP, in a completely randomized design with three clonal replicates. Phenotypic evaluation was performed in 2012 for the following characteristics: fruit mass (g), longitudinal fruit diameter (cm), transverse fruit diameter (cm), ratio of longitudinal fruit diameter (cm)/transverse fruit diameter (cm), number of segments per fruit, number of seeds per fruit, juice yield (mL), soluble solids (ºBrix), ratio of soluble solids/acidity, and number of fruits per box.
A sample from each replicate, consisting of five fruits each, was used to obtain phenotypic data. Fruit mass was measured using a digital balance with a capacity of 15 kg and accuracy of 5 g; the longitudinal and transverse fruit diameters were measured using a graduated ruler; the number of segments per fruit and seed number were directly counted; the number of fruits per box was obtained by dividing capacity of the box (40.8 kg) by the average weight of the fruit; juice yield was measured after crush extraction in an OIC model OTTO 1800 and the juice/fruit ratio was calculated. The soluble solids content (ºBrix) was determined by direct reading on a refractometer (B & S model RFM 330) and acidity was determined by titrating 25 mL juice against normal sodium hydroxide solution with phenolphthalein as an indicator. The soluble solids/acidity ratio, which indicates the degree of fruit ripening, was also calculated. All analyses were performed in the Quality and Postharvest Laboratory of the Center APTA Citrus Sylvio Moreira/IAC, Cordeirópolis, SP.
Statistical model for the analysis of phenotypic data
Phenotypic data were analyzed to estimate genetic variance and heritability using the REML/BLUP method, also called the mixed model methodology (Henderson, 1973), using the Selegen REML/BLUP software (Resende, 2002). The following model was fit: y = Xu + Zg + e; where, y is the vector of observations; u is the scalar related to overall mean (fixed effect); g is the vector of genotypic effects (assumed to be random); and e is the residual vector (random). The capital letters (X, Z) represent incidence matrices for these effects. The structures of means and variances associated with the model are: y | u, V~N(Xu, V); g |
is a penalization parameter; and
is the individual broad-sense heritability.
Variance components were estimated by restricted maximum likelihood (REML) (Patterson and Thompson, 1971) iterating the mixed model equations. The prediction accuracy of genotypic values for clone selection was calculated from the inverse of the left-hand side of the mixed model equations (Resende, 2002).
Genotyping and quality control of the markers
Molecular assessment of 180 individuals in the hybrid Citrus population was performed using the diversity arrays technology (DArT) method (Jaccoud et al., 2001), which reduces genome complexity by using a combination of restriction enzymes and a next-generation sequencing technique called DArT_seqTM. Total DNA was extracted, quantified on a Nanodrop-8000 spectrophotometer, normalized to 100 ng/μL, distributed onto plates, and submitted to genotyping on the Diversity Arrays Technology platform (DArT P/L, Australia).
The method used to obtain the DArT_seqTM markers involved reducing genome complexity using the restriction enzymes PstI and TaqI, and subsequent sequencing. PstI and adapters specific for 96 different bar codes were linked to the restriction fragments. The resulting products were amplified and their quality was checked. Samples were sequenced on an Illumina Hiseq2000 machine. The PstI adapters included a sequencing primer such that the sequences were always read from the PstI restriction sites. The resulting sequences (FASTQ files) were filtered for quality, with a confidence cut-off of 90%. Sequences were aligned with the Clementine tangerine reference genome available at
Quality control of the markers was performed by eliminating low polymorphic loci [minor allele frequency (MAF) ≤ 5%] and/or loci with low call rate in the genotyped individuals (call rate ≤ 95%). MAF was calculated using the equation: MAF = min[p, (1 - p)], where p is the allele frequency in a locus; and call rate was calculated using the expression: Call Rate = 1 - M / P, where M refers to individuals with missing genotypes and P refers to individuals with present genotypes (not missing).
Predicting genetic effects of markers using random regression (RR)-BLUP
The RR-BLUP method (Meuwissen et al., 2001) was used to predict the genetic effects of the markers. This method fits the random effects of markers as a covariate with effects on phenotypes (Resende et al., 2012).
The estimators associated with this random regression procedure promote shrinkage on the marker effects as a function of the penalty parameter λ, which makes it possible to estimate a higher number of parameters than the number of data points. The penalty parameter is given by:
The following general linear mixed model was designed to predict the effects of the markers: y = Xu + Wma + Smd + e; where y is the vector of phenotypic observations; u is the vector of fixed effects (overall means); ma is the vector of additive genetic marker effects; md is the vector of dominant genetic marker effects; e is the vector of random errors; X, W, and S are the incidence matrices for u, ma, and md, respectively.
Parameterizations of additive and dominant marker effects with dominant markers such DArT were performed according to the method described by Viana and Resende (2014). In an allogamous population in Hardy-Weinberg equilibrium (HWE), appropriate parameterization in the W matrix, associated with additive effects, is ‘0’ for the absence of the marker (mm type) and 1.78 for the presence of the marker (MM or Mm types). In the S matrix along with the effects of dominance, parameterization is as follows: ‘0’ for the absence of the marker (mm) and 0.89 for the presence of marker (MM or Mm). Assuming HWE, the additive genetic variation of the character in the population is given by:
The means and variance structure is defined as: ma ~ N(0, WW’
The mixed model equations to predict ma and md using the RR-BLUP method are:
The genomic values (GV) of individuals were estimated using the equation: GV = Wma + Smd. All analyses were performed on the R software, version 3.1.2 (R Development Core Team, 2012) using the RR-BLUP package (Endelman, 2011).
Predictive ability, prediction bias, and accuracy of GWS
The predictive ability (ryg) was obtained by correlating the corrected phenotypic values with predicted genomic values and was determined by cross-validation through a Jackknife procedure. Prediction bias (b) was calculated as the regression coefficient of phenotypic values on the predicted genomic values, wherein ‘b’ values close to 1 indicate that predictions are not biased and are effective in predicting the actual magnitude of difference between the individuals being assessed (Resende et al., 2012).
The experimental accuracy of GWS was obtained by the estimator: rgg = ryg/h, in which rgg corresponds to the accuracy of GWS; ryg the predictive ability; and h the square root of the heritability of the character being studied (Legarra et al., 2008).
The expected accuracy was calculated using the formula proposed by Resende et al. (2008):
The estimated number of QTL controlling each character was calculated by the expression
Efficiency of GWS
The superiority of GWS over selection based on phenotypic data only in a perennial species, is mainly related to the gain per unit time (Resende et al., 2012). To confirm this, the gain in selective efficiency of GWS, compared to selection based only on phenotypes, was calculated using the expression:
RESULTS AND DISCUSSION
After quality control, 5287 markers were found to be useful and their effects were estimated. The estimated genome size of the sweet orange is, according to Jarrell et al. (1992), 1700 cM. Therefore, the density of markers used in the analyses was ~3 markers per cM, which corresponds to a distance of 0.32 cM between markers. This value was used to calculate the proportion of variance that could be explained by the markers (
Expected proportion of genetic variance explained by markers (
Estimates of genomic heritability (
Estimates of genomic heritability (h2G
, predictive ability (ryg), prediction bias (b), accuracy of GWS (rgg), quantitative trait loci (QTL) number estimates (nQTL), and proportion of genetic variance explained by the markers (r2mq when selected based on the magnitudes of their effects in a breeding Citrus population.
h2: Heritability based on phenotypes;
The fraction of trait heritability attributable to the markers (ω) varied from 20% (ratio of soluble solids/acidity) to 53% (ratio of longitudinal/transverse fruit diameter). The number of markers necessary to provide these fractions ranged from 100 for soluble solid content to 2000 for the number of seeds (Table 1). As the number of markers increased, the accuracy values described by these markers tended to remain constant once a maximum value was reached.
The observed missing heritability, defined as the fraction of heritability not accounted for by markers, is probably due to incomplete LD among the alleles causing genetic variation and the markers used for heritability estimation. LD may be lost when the alleles causing genetic variation have lower MAF values than the marker (Kemper and Goddard, 2012).
The predictive ability of GWS, which reflects the ability of molecular information to consistently predict a phenotype (Cavalcanti et al., 2012), ranged from 0.53 for the ratio of soluble solids/acidity and soluble solids traits, to 0.64, for the longitudinal diameter of fruit, fruit mass, and number of fruits per box (Table 1). These magnitudes are similar to those reported for forest trees species (Resende et al., 2012, Resende Junior et al., 2012) and dairy cattle breeding (Hayes et al., 2009).
The regression coefficient (b), which is a measure of prediction bias, showed values close to unity, implying there was no prediction bias. However, the characteristic soluble solids had a ‘b’ value of 0.74, indicating that genetic variance was overestimated (Resende et al., 2012).
Estimates of GWS accuracy ranged from 0.59 to 0.91, implying moderate-to-high accuracy. According to the classification proposed by Resende and Duarte (2007), accuracy estimates <50% are defined as low, while those between 50 and 70% are defined as moderate, and those between 70 and 90% are defined as high. Conversely, estimates of accuracy based on phenotypic selection ranged from 0.70 to 0.90, implying high accuracy. Comparison of these values shows that accuracy estimates based on phenotypic selection were superior to those obtained by GWS for all characteristics, except for the ratio of the longitudinal/transverse fruit diameter trait.
The experimental accuracy values were close to the expected accuracy values (Table 1). The formula used to obtain expected accuracy values is important in the design of GWS studies as it can be used to calculate the number of individuals and markers required to obtain the desired accuracy. Using the formula for expected accuracy, it was observed that for a characteristic with the lowest heritability (ratio of longitudinal/transverse diameter; 0.49), at least 250 individuals and 4000 markers would be necessary to obtain an accuracy value greater than 0.70. Conversely, for traits with high heritability, such as fruit mass and ratio of soluble solids/acidity, the use of 180 individuals and 1000 markers yielded a much higher accuracy of 0.81. According to Grattapaglia and Resende (2011), accuracy calculated using the deterministic approach is directly proportional to the product of heritability and the ratio between the number of individuals used and the number of QTL governing the trait. In a simulation study, these authors observed that decreased accuracy with decreasing heritability could be compensated for by using a larger number of individuals in the training population.
The proportion of variance explained by the markers ranged from 36% (ratio of soluble solids/acidity) to 84% (ratio of longitudinal fruit diameter/transverse fruit diameter). The genetic variance not explained by these markers is, therefore, probably due to incomplete LD between these markers and the alleles causing genetic variation in this character. GWS could explain a greater proportion of the observed variance in traits compared with methods based on the identification of individual QTL. This was observed by Viana et al. (2016) who compared MAS and genomic selection in a grape breeding population consisting of 143 individuals obtained from a cross between (Vitis rupestris x Vitis arizonica/girdiana) x Vitis vinifera. The efficiency of genomic selection was 1.7 to 11.6-fold higher than that obtained by traditional QTL analysis.
In Citrus culture, the use of MAS also permitted the identification of a few QTL with major effects on traits of interest. Siviero et al. (2002) found only one QTL associated with the quantitative inheritance of characteristics (number of fruits per plant and number of seeds per fruit) in a F1 progeny derived from a cross between Citrus sunki and Poncirus trifoliata 'Rubidoux'. A study on root-rot resistance caused by Phytophthora in progeny derived from a cross between C. sunki and P. trifoliata (Siviero et al., 2006) provided two QTL maps for the species P. trifoliate, which could explain 16-24% of the genetic variation. In addition, only one QTL was mapped for the species C. sunki, which could only account for 14% of the variation. Asins et al. (2012) performed a QTL analysis in a segregating population of C. grandis and C. clementine to evaluate resistance to Citrus tristeza virus and identified a QTL that contributed to about 24% of the total genetic variance.
Considering the five markers with large effects for each trait, only 21.5-31.36% of the variance was explained. Those values only account for part of the genetic variation and further confirm the quantitative nature of these characteristics, which are governed by many genes of small effect and, probably, by some genes with moderately large effects (Kemper and Goddard, 2012).
The estimated number of QTL controlling a trait varied from 11 (ratio of longitudinal fruit diameter/transverse fruit diameter) to 219 (ratio of soluble solids/acidity) (Table 1). There have been no reports in the literature on the number of loci that control these traits, as the studies conducted so far with MAS permitted the detection of only one or more loci with major effects. These results confirm the quantitative nature of the traits studied and reiterate that the accuracy of GWS is inversely proportional to the number of QTL identified (Wong and Bernardo, 2008). In a study conducted by Resende et al. (2012) in two populations of Eucalyptus, 200 markers with large effects were sufficient to account for more than 80% of the heritability of the traits analyzed. An increase in the number of markers did not result in a linear increase in the accuracy of GWS (Figure 2).
Accuracy of genome wide selection (GWS) and expected accuracy with an increase in the number of markers for traits evaluated in a population of Citrus.
The accuracy of GWS depends on the proportion of the variance that can be explained by the markers, and the accuracy of predictions of marker effects that are in LD with the QTL. Thus, the main factors that modify the accuracy of GWS are the heritability of the trait, the number of loci that control this trait and the distribution of its effects, the number of individuals in the training population, and the extent of LD between the markers and the QTL, which, in turn, depends on the effective population size and the number of markers used for the analyses (Grattapaglia and Resende, 2011).
The importance of the number of individuals used to estimate the effect of the loci is shown by the accuracy formula. Therefore, the higher the number of individuals used, the more accurate the genomic value estimates for those individuals (Hayes et al., 2009). Thus, the use of only 180 individuals to predict the effects of the markers probably contributed to the moderate accuracy. Of note, the number of individuals used in this study was acceptable for the selection of superior clones in full-sib families since this number represents 99.45% of the genetic variability of the family. This fraction is given by Nef / 2, where Nef = 2n / (n + 1) is the effective population size of a full-sib family and n is the number of individuals per family.
According to Resende et al. (2012), when more individuals per family and more families are analyzed, the power of detection increases asymptotically and more QTL are revealed. Zhao et al. (2013) analyzed accuracy values based on the predicted effects of genetic markers in the entire maize population consisting of 11 families or those within each family. Those authors reported that the estimates of accuracy were substantially reduced when individual GV were predicted within a family compared with those predicted in the entire population. However, when the genetic effects were estimated and the GV was predicted within each family (for large families), the accuracy was satisfactory, as seen by the results presented here. Cavalcanti et al. (2012) confirmed that within families, GWS is an interesting alternative that can be used to increase the breeding efficiency of cashew. In that study, the authors obtained a high accuracy value (0.86) for a characteristic with high heritability with only 74 individuals and a limited number of markers (238).
One of the assumptions of GWS is that the phases of LD between markers and QTL are identical in the populations used for training and selection (Hayes et al., 2009). As genotyping and phenotyping were performed using individuals of one family, it is likely that they did not cover the extent and pattern of the LD present in all Citrus populations used in the breeding program; therefore, the use these results in other families or populations is not recommended.
The small effective size of the Citrus population used in this study (Ne = 2) and the large number of markers used have contributed favorably to the accuracy values obtained. The impact of the effective size and number of markers used to predict GWS accuracy has been demonstrated by Grattapaglia and Resende (2011) using simulated data. Resende et al. (2012) applied GWS to two Eucalyptus breeding populations and found that higher accuracy estimates were observed in the population that had the lowest Ne. This is because an increase in the effective size reduces the accuracy of genomic predictions due to an increase in the number of chromosome segments and because LD between the marker and the causative mutation only remains consistent if the two occur very close to each other on the chromosome (Kemper and Goddard, 2012).
GWS requires a high density of markers in order to maximize the number of QTL that are in LD with at least one marker (Heffner et al., 2009; Resende et al., 2014). Resende et al. (2008) demonstrated the importance of using a high density of markers to account for markers of both large and small effects, and thereby to increase the probability of explaining all of the genetic variation for the trait of interest. The results of this study are consistent with those claims since the proportion of variance explained by the markers was higher than that in studies that used a smaller number of markers. Table 2 compares the efficiency of GWS with selection based solely on phenotypic data.
Gains in efficiency of GWS for selection based solely on phenotypic data in a breeding population of Citrus.
|Trait||Reduction in the selection cycle|
DL: longitudinal fruit diameter (cm); DT: transverse fruit diameter (cm); DL/DT: ratio of longitudinal fruit diameter/transverse fruit diameter; FM: fruit mass (g); F/B: number of fruits per box; NG: Number of segments per fruit; NS: Number of seeds per fruit; JY: juice yield (mL); SS: soluble solids content; Rt: ratio of soluble solids/acidity.
When reducing the duration of the selection cycle by 25%, genetic gain with GWS was higher than that obtained by phenotypic selection for all traits, except for soluble solids and the ratio of soluble solids/acidity (Table 2). When time was reduced by 50%, GWS was superior (ranging from 31 to 160%) for all traits. When time was reduced by 75%, GWS improved (ranging from 162 to 420%). The higher gains in efficiency obtained for the ratio of longitudinal fruit diameter/transverse fruit diameter with GWS further confirm the importance of GWS in the selection of characteristics that exhibit low heritability.
Heffner et al. (2009) stated that even when only moderate accuracy is achieved using GWS, it is possible to obtain a genetic gain that is superior to that obtained by phenotypic selection, as GWS reduces the duration of the selection cycle. This was clearly observed in the present study, in which the duration of the selection cycle was reduced. Wong and Bernardo (2008) reported that GWS implementation shortened the selection cycle of oil palm from 19 to 6 years, which authors claimed could also be extrapolated to other perennial species. That study used a population of 50 individuals, which is the typical size for an oil palm breeding population. GWS yielded higher gains per unit time and reduced the costs even with this small number of individuals. Importantly, GWS has been shown to be a promising tool for genetic improvement when applied to traits of economic interest that are difficult to measure and/or have a high cost of evaluation (tolerance to cold, drought, disease resistance) (Grattapaglia and Resende, 2011). These kinds of traits should be studied using genomic selection in Citrus breeding.
The results presented here suggest that GWS can be useful for Citrus breeding as it can predict the phenotypes accurately and timeline, and reduce the length of the selection cycle. Thus, GWS is a promising tool for improving Citrus culture. This may be the first study to demonstrate the feasibility of genomic selection in Citrus.