Sequencing of Gag/Env association with HIV genotyping resolution and HIV-related epidemiologic studies of HIV in China
HIV genotyping has led to conflicting results between laboratories. Therefore, identifying the most accurate gene combinations to sequence remains a priority. Datasets of Chinese HIV subtypes based on several markers and deposited in PubMed, Metstr, CNKI, and VIP databases between 2000 and 2015 were studied. In total, 9177 cases of amplification-positive samples from 26 provinces of China were collected and used to classify HIV subtypes based on eight individual genes or a combination thereof. CRF01_AE, CRF07_BC, CRF08_BC and B were the prevalent HIV subtypes in China, accounting for 84.07% of all genotypes. Gag/Env sequencing classified a greater number of HIV subtypes compared to other genes or combination of gene fragments. The geographical distribution of Gag and Gag/Env genotypes was similar to that observed with all genetic markers. Further principal component analysis showed a significantly different geographical distribution pattern of HIV in China for HIV genotypes detected with Gag/Env, which was in line with the distribution of all HIV genotypes in China. Gag/Env sequences had the highest diversity of the eight markers studied, followed by Gag and Gag/Pol/Env; Pol/Env polymorphisms were the least divergent. Gag/Env can serve as a high-resolution marker for HIV genotyping.
Infection with human immunodeficiency virus (HIV) typically leads to development of acquired immunodeficiency syndrome (AIDS). HIV is an enveloped RNA virus (Zhang et al., 2015) that destroys the immune system by attacking and subsequently depleting CD4+ T lymphocytes. Highly-active anti-retroviral therapy is an effective therapeutic strategy for reducing the viral load in AIDS patients; however, it cannot completely remove the HIV reservoir from these individuals (Bao et al., 2014; Dai et al., 2015). The high mutation rate of HIV is one of the major factors that has hindered development of an HIV vaccine; this problem is compounded by the enhanced diversity of the virus resulting from its high recombination rate (Su et al., 2000). Thus, understanding the molecular epidemiology of HIV/AIDS is increasingly important as it will shed light on the origin and global distribution of AIDS, and will facilitate strategies designed to prevent the spread of HIV.
The HIV genome consists of two identical positive-stranded RNA molecules, which can be reverse-transcribed into double-stranded DNA; each of the RNA strands contains approximately 9200-9800 base pairs. HIV contains two long terminal repeats (LTR) at the end of its genome that include cis-regulatory sequences important for the expression of the provirus. In addition, at least nine protein-coding genes are distributed between the LTRs, including three structural proteins (Gag, Pol, and Env), two regulatory proteins (Tat and Rev), and four accessory proteins (Vif, Vpr, Vpu, and Nef) (Sides et al., 2005; Cunha et al., 2012). Whole-genome sequencing of HIV facilitates the identification of HIV subtypes, and this strategy is considered the gold standard for HIV classification (Robertson et al., 2000). However, the high cost of whole-genome sequencing has limited its usage, and thus sequencing of focused genomic regions, such as the Gag, Pol, Env, and Vpr genes or a combination of different gene fragments, is viewed as a relatively cost-effective alternative (Preston et al., 1988; Mansky and Temin, 1995; Lan et al., 2008Chen et al., 2011; Chen et al., 2012). Three structural genes (Chen et al., 2011) [Gag (Yao et al., 2012; Ye et al., 2012), Env (Pang et al., 2012; Ye et al., 2012; Li et al., 2014aHolguín et al., 2008; Chen et al., 2014; Yan et al., 2015)] have been widely used for genotyping HIV. Previous studies led to the classification of HIV into two main types, HIV-1 and HIV-2 (Zhang et al., 2015), with HIV-1 being the predominant branch. The HIV-1 class includes M (major), O (outlier), N (non-M, non-O), and P (Vallari et al., 2011; Li et al., 2014b); among these subtypes, group M accounts for most global HIV cases. In total, however, 11 genetic clusters (A-K) and more than 72 circulating recombinant forms (CRFs) have been identified based on variations in the Env gene (http://www.hiv.lanl.gov.html). Significantly, although most HIV strains have been assigned to specific genotypes based on their Gag, Env, and Pol sequences, conflicting genotypes for the same individual have been reported, most likely because of the high mutation rates of the HIV genome (Robertson et al., 1995; Perelson et al., 1996). For example, one patient was reported to be infected with subtype C based on sequencing of the Pol gene and the p17 fragment of Gag gene; however, the same individual was classified as CRF07_BC with the p24 fragment of the Gag gene, as well as subtype B with the Vpu gene (Robertson et al., 2000; Chen et al., 2013). To avoid such discrepancies, simultaneous analysis of different gene fragments, such as Gag/Env (He et al., 2012; Chen et al., 2013Yao et al., 2012), Pol/Env (Chen et al., 2012; Pang et al., 2012), and Gag/Pol/Env (Lan et al., 2008; Ye et al., 2012; Zeng et al., 2012; Li et al., 2014a; Dai et al., 2015
To solve this problem, we extracted and analyzed datasets of HIV genotypes found in Chinese AIDS patients since 2000 that had been classified based on sequencing of single genes or combinations of multiple gene fragments. The findings were used to describe the general distribution pattern of HIV genotypes in different Chinese populations. Furthermore, we evaluated the spatial frequency distribution of HIV genotypes based on different markers and estimated the genetic diversity associated with each marker. Our study thus sheds light on the molecular epidemiology of HIV and can be used to guide future studies in this clinically relevant field.
MATERIAL AND METHODS
The datasets for molecular epidemiology of HIV in China between 2000 and 2015 were extracted from the PubMed, Metstr, CNKI, and VIP databases; the last two databases contained data from Chinese journals and were indexed with “HIV”, “HIV-1”, “AIDS”, “molecular epidemiology”, and “China” as keywords. In total, 339 potentially relevant articles were screened, but only 71 articles were subsequently selected based on our strict criteria. Data from the literature include some uncertain subtypes and unique recombinant forms (Tan et al., 2010; Chen et al., 2011Zhou et al., 2011). We considered 12,613 samples distributed among 43 cities, which covered 26 provinces of China; among them, 9177 cases were classified into certain haplotypes based on Env, Gag, Pol, Vpr, Gag/Env, Gag/Pol, Pol/Env, and Gag/Pol/Env markers. The original data for these 9177 individuals were retrieved and analyzed in this study (
The datasets were retrieved based on reports in 71 articles. The frequencies of each genotype with different gene markers were calculated and compared using the Microsoft Office software (Version 2007). Counter maps of the spatial frequencies were constructed to elaborate the geographical distribution patterns of haplotypes for each of the markers using the Kriging algorithm of Surfer 8.0 (Golden Software Inc., Golden, CO, USA) (Cavalli-Sforza et al., 1994). To evaluate the genetic diversity represented by each fragment or combination of different fragments (Table 1), we used the Arlequin 3.11 software and considered all of the datasets belonging to the same molecular marker as one group (Excoffier et al., 2007). Further principal component analysis (PCA) was conducted based on the haplotype frequencies as described previously (Yao et al., 2002).
Haplotype diversity of HIV in China identified with differential gene or combination of gene fragments.
|Gene fragment||Samples||Genotype number||Gene diversity (means ± SD)|
|Env||825||16||0.7811 ± 0.0087|
|Pol||1684||21||0.6855 ± 0.0097|
|Vpr||300||7||0.6804 ± 0.0171|
|Gag/Pol||706||11||0.6752 ± 0.0154|
|Pol/Env||723||17||0.4661 ± 0.0225|
The genetic diversity of each gene or combination of two or three gene fragments was calculated based on haplotypes identified with each marker. The combination of Gag/Env fragments is associated with the highest genetic diversity flagged in bold and italicized numbers, followed by Gag, and then combination of Gag/Pol/Env gene fragments flagged in bold numbers.
General profile of HIV molecular epidemiology in China
We classified 9177 samples into discrete haplotypes based on the subtype information of 8 individual genes, or a combination of two or three gene fragments. This classification included Env, Gag, Pol, Vpr, Gag/Env, Gag/Pol, Pol/Env, and Gag/Pol/Env, which have been widely used to assign HIV haplotypes in China. With the exception of some individuals who could not be confidently assigned, most cases were assigned to 40 discrete haplotypes (
Histogram of the predominant HIV haplotypes in China as identified with different markers.
To evaluate the resolution of the aforementioned 8 genes or combination of two or three gene fragments for HIV genotyping, the genotype for each marker was determined, as shown in
Population structure based on HIV haplotypes in China with different markers
To determine which marker was optimal in terms of describing the landscape of HIV haplotypes in China, spatial distribution patterns were constructed by considering all of the patients screened with different markers as a group, and by considering the haplotype frequency as the input factor (Figure 2a). The distribution patterns of other markers were also constructed by considering all of the patients detected with the same marker as one group (Figure 2b-d and
Spatial frequency distributions of HIV haplotypes identified with different gene or combinations thereof. The spatial frequency distributions were created using the Kriging algorithm of the Surfer 8.0 package. The original absolute frequencies are listed in
Spatial distribution pattern of HIV in China
We next examined whether or not the molecular epidemic spectrum of HIV in China as indicated by the Gag/Env marker was consistent with the general spectrum of HIV in China. To this end, samples from 9177 patients, whose viral infections had been classified using different markers, were considered as a group, and PCA was performed to derive a clustering pattern for HIV-infected groups from different regions of China (Figure 3a). As shown in Figure 3a, a general principal component (PC) map of HIV in China derived from the first two PCs accounted for 68.65% of the total variation. An obvious geographical distribution pattern of HIV haplotypes was observed; the first PC separated groups in eastern China (such as Shanghai, Zhejiang, and Guangdong) from those of other regions, such as populations from northern (Beijing, Shaanxi, Hebei, and Henan) and southern (Yunnan and Guangxi) China. The second PC contributed to the south-to-north cline; groups from Yunnan and Guangxi and those from Beijing, Shaanxi, Henan, Hebei, and Xinjiang were located between the groups from southern and northern China. We also derived a PCA plot by screening groups with other genes or combinations of gene fragments, as shown in Figure 3b, c and d. The PC maps based on the haplotypes identified with Gag/Env (Figure 3b), Gag/Pol/Env (Figure 3c), and Gag (Figure 3d) had distribution patterns similar to those of the general molecular epidemic spectrum of HIV in China, and were different from those identified by analysis of other single genes or combinations, including Env, Pol, Vpr, Gag/Pol, and Pol/Env (
PCA plot of HIV infected populations in China.
Statistical index of different markers on identifying HIV
As shown in Table 1 and
HIV infection is the causative step in the development of AIDS. Thus, genotyping of HIV and further delineating its geographical distribution pattern will help to limit diffusion of HIV strains, as well as enhance the understanding of the mechanisms underlying resistance to anti-HIV drugs and provide data that can be used to inform vaccine development (Preston et al., 1988; Perelson et al., 1996; Robertson et al., 1995; Robertson et al., 2000Robertson et al., 1995; Robertson et al., 2000), but the high cost of this method has limited its widespread adoption. Rather, many groups have opted to sequence single genes or combinations of two or three genes to reduce costs. In this regard, the eight markers Env, Gag, Pol, Vpr, Gag/Env, Gag/Pol, Pol/Env, and Gag/Pol/Env have often been used to subtype HIV in China (Lan et al., 2008; Chen et al., 2012; Pang et al., 2012; Yao et al., 2012; Ye et al., 2012; Zeng et al., 2012; Li et al., 2014aDai et al., 2015). However, the high mutation and recombination rates associated with the HIV genome (Robertson et al., 1995; Robertson et al., 2000), have led to ambiguous and sometimes contradictory genotyping results, even in the same individual (Robertson et al., 2000; Qiu et al., 2005; Chen et al., 2013). This has limited the further understanding of the molecular epidemiology of HIV in China, one of the most affected countries in the world.
By extensively dissecting the genotype of 9177 HIV-infected Chinese patients using the eight markers described above, we detected approximately 40 haplotypes of HIV. CRF01_AE (37.97%), CRF07_BC (16.02%), CRF08_BC (15.03%), and B (15.05%) comprised the majority of haplotypes (84.07%). Genotyping based on Gag/Env identified 22 haplotypes, which accounted for 55% of the total haplotypes detected throughout China; this marker was much more robust than any of the other 7 genes or combinations thereof. Furthermore, by considering the dataset detected with differential markers as a whole, we were able to analyze the spatial distribution and regional distribution of HIV strains across China. There was a distinct south-to-north cline pattern, indicating that HIV strains developed independently in China. The intermediate positions in the principle component map for Xinjiang Province imply that admixture of HIV subtypes from northern and southern China occurred at this location. This pattern was supported by the datasets derived from sequencing of Env and Gag/Env, which revealed that Gag/Env accounted for the highest gene diversity among all markers tested. This may be explained by the relative mutation rates of Gag and Env, which encode two of the three core structural proteins of HIV (the core protein and the envelop protein, respectively) (Sides et al., 2005; Cunha et al., 2012). However, these two genes have distinct mutation rates. The former is relatively conserved and has a low mutation rate of 6.0% (Su et al., 2000), which is helpful for identifying basal mutations among the HIV genomes. In contrast, the Env gene is associated with a higher mutation rate than the other genes studied, and indeed has the highest evolutionary rate (30%) (Su et al., 2000). This underlies the diversity of Env sequences, and in part explains why Env is not a strong marker for subtype classification. However, combined analysis of Gag and Env compensates for the drawbacks of each individual gene and significantly improves the subtype marker classification power. Moreover, Pol had a lower mutation rate (3%) than the other genes (Su et al., 2000), suggesting that either Gag or Pol can serve as a key marker for identifying HIV subtypes; data from sequencing of these genes can then be merged with Env data to further increase the robustness of HIV classification. Our PCA of datasets derived from screening of Gag/Env and Gag/Pol/Env revealed a similar geographic distribution pattern of HIV in China relative to the dataset including all the HIV groups. This further confirmed that the combined analysis of Gag/Env provides the highest-resolution marker with respect to HIV genotyping. In addition, the separation of southern, northern, and eastern HIV groups in China, as well as the south-to-north cline of HIV groups, implies that the different strains of HIV were initially introduced independently of one another and subsequently dispersed throughout China. These data will be valuable in developing strategies to prevent the spread of HIV.
In this study, the polymorphism analysis of Gag/Env of HIV in China, which showed relatively higher genetic diversity, suggested that this fragment may serve as an effective biomarker for genotyping of HIV in this region. However, we only reanalyzed previously published data to gain more detailed information regarding the epidemiology of HIV in China. More sequences of focused genomic regions and of combination fragments are needed to be analyzed and compared with the whole-genome sequencing to develop a database system for genotyping analysis of HIV (Araújo et al., 2006).