User Tools

Site Tools


gwas

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
gwas [2019/07/11 17:28]
sylvia
gwas [2021/05/27 17:21] (current)
trynke
Line 1: Line 1:
-''''​====== GWAS ====== +====== GWAS ====== 
-IN PROGRESS... + 
 +GWAS is one of Lifelines'​ [[additional assessments]] and contains the genome-wide [[https://​en.wikipedia.org/​wiki/​Single-nucleotide_polymorphism|SNP]] data of 15.400 adult Lifelines participants derived from the [[https://​www.illumina.com/​products/​by-type/​clinical-research-products/​human-cytosnp-12.html|Illumina CytoSNP-12v2 array]].\\ ​  
 +The name GWAS might be confusing, as the assessment is not a [[https://​en.wikipedia.org/​wiki/​Genome-wide_association_study|"​genome wide association study"​]] in itself, but rather the formation of a database of SNP data that can be used, together with Lifelines phenotype data, to perform such studies.\\ 
 +Note that a second set of participants is genotyped in the [[UGLI]] project.
  
 ===== Subcohort ===== ===== Subcohort =====
  
-Genome-wide genotype data based on the Illumina CytoSNP-12v2 array are currently available for 15.400 ​participants. All GWAS participants are independent (no biological family relations), Caucasian-ancestry samples ​of adults which were collected at [[1a_visit_2|Baseline assessment second ​visit]].+The GWAS subcohort consists of 15.400 independent (no biological family relations), Caucasian-ancestry ​adult Lifelines participants. DNA samples were collected at [[1a visit 2]].
  
 Table 1: General information on the GWAS subcohort. *Age at Baseline 2nd visit. Table 1: General information on the GWAS subcohort. *Age at Baseline 2nd visit.
Line 16: Line 19:
 | Age category 65+                     | N=1510 ​        | | Age category 65+                     | N=1510 ​        |
  
 +{{:​gwas_age_distribution.jpg?​400|}}
  
-===== SNP genotyping ​===== +===== SNP array ===== 
- +The Illumina human CytoSNP12 Beadchip ​was chosen to study genetic variations in the LifeLines ​GWAS cohort.
-SNP genotyping is the measurement of genetic variations of single nucleotide polymorphisms (SNPs) between members of a species.\\ NEEDS WORK!!\\ ​ +
-The Illumina human CytoSNP12 Beadchip ​has been chosen to study genetic variations in the LifeLines cohort ​study. +
- +
-==== SNP array ==== +
- +
-...EXPLAIN WHAT a SNP array is...\\+
 The 12-sample HumanCytoSNP-12 BeadChip is a powerful, whole-genome scanning panel designed for efficient, high-throughput analysis of genetic and structural variations that are the most relevant to human disease. Many types and sizes of structural variation in the human genome that affect phenotypes can be detected with the HumanCytoSNP-12 BeadChip, including duplications,​ deletions, amplifications,​ copy-neutral LOH, and mosaicism. This BeadChip includes a complete panel of genome-wide tag SNPs and markers targeting all regions of known cytogenetic importance. It incorporates 200,000 SNPs which cover around 250 genomic regions commonly screened in cytogenetics laboratories,​ including subtelomeric regions, pericentromeric regions, sex chromosomes,​ and targeted coverage in around 400 additional disease-related genes. The 12-sample HumanCytoSNP-12 BeadChip is a powerful, whole-genome scanning panel designed for efficient, high-throughput analysis of genetic and structural variations that are the most relevant to human disease. Many types and sizes of structural variation in the human genome that affect phenotypes can be detected with the HumanCytoSNP-12 BeadChip, including duplications,​ deletions, amplifications,​ copy-neutral LOH, and mosaicism. This BeadChip includes a complete panel of genome-wide tag SNPs and markers targeting all regions of known cytogenetic importance. It incorporates 200,000 SNPs which cover around 250 genomic regions commonly screened in cytogenetics laboratories,​ including subtelomeric regions, pericentromeric regions, sex chromosomes,​ and targeted coverage in around 400 additional disease-related genes.
  
 +===== Quality checks =====
  
-====Quality checks==== +Quality controls of the data are based on SNP filtering on minor allele frequency (MAF) above 0.001, Hardy-Weinberg equilibrium (HWE) P-value >1e-4, call rate of 0.95 using Plink ((Purcell S Neale B Todd-Brown Ket al. . PLINK: A tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet  2007;​81:​559–75)),​ and principal component analysis (PCA) to check for population outliers. Description and criteria for quality checks are listed in the Table 2.
- +
-Quality controls of the data are based on SNP filtering on minor allele frequency (MAF) above 0.001, Hardy-Weinberg equilibrium (HWE) P-value >1e-4, call rate of 0.95 using Plink ((Purcell S Neale B Todd-Brown Ket al. . PLINK: A tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet  2007;​81:​559–75)),​ and principal component analysis (PCA) to check for population outliers ​((Salome Scholtens, Nynke Smidt, Morris A Swertz, Stephan JL Bakker, Aafje Dotinga, Judith M Vonk, Freerk van Dijk, Sander KR van Zon, Cisca Wijmenga, Bruce HR Wolffenbuttel,​ Ronald P Stolk, Cohort Profile: LifeLines, a three-generation cohort study and biobank, International Journal of Epidemiology,​ Volume 44, Issue 4, August 2015, Pages 1172–1180,​ https://​doi.org/​10.1093/​ije/​dyu229)). Description and criteria for quality checks are listed in the Table 2.+
  
 Table 2: Description and criteria for quality checks Table 2: Description and criteria for quality checks
Line 58: Line 55:
 |                                                                     ​| ​                                                                                            ​| ​                                                                                                                                                                                                                                                                     | |                                                                     ​| ​                                                                                            ​| ​                                                                                                                                                                                                                                                                     |
  
 +===== Imputation =====  ​
  
-==== Imputation ====   +To get more information about the genome of the participants and therefore to scale up the number of SNPs, the genotype data generated using the arrays ​were imputed against reference genomes obtained by means of whole genome sequencing. IMPUTE2 (ref) is a program that can predict what the missing SNPs will be based on known SNPs or haplotypes (a combination of SNPs) by mapping the known genotypes against reference genomes. ​
- +
-To get more information about the genome of the participants and therefore to scale up the number of SNPs, the genotype data generated using the arrays ​can be imputed against reference genomes obtained by means of whole genome sequencing. IMPUTE2 (ref) is a program that can predict what the missing SNPs will be based on known SNPs or haplotypes (a combination of SNPs) by mapping the known genotypes against reference genomes. ​See Figure 1 for illustration.\\ +
-  +
-{{:​imputation2.png?​600|}}\\ +
-Figure 1: Most common scenario in which imputation is used: unobserved genotypes (red question marks) in a set of study individuals are imputed (or predicted) using a set of reference haplotypes and genotypes from a SNP chip. Figure taken from ((https://​mathgen.stats.ox.ac.uk/​impute/​impute_v2.html#​home))\\+
  
 Before imputation, the genotypes were pre-phased using SHAPEIT2 (ref)  and aligned to the reference panels using Genotype Harmonizer (www.molgenis.org/​systemsgenetics) in order to resolve strand issues. Generation of cleaned pedigree files and in- and output files for different imputation algorithms: Pedigree files are created in PLINK binary format. Genotypes of SNPs not directly genotyped on the Cyto12-SNP chip, but genotyped in another study, can be inferred by imputation. Before imputation, the genotypes were pre-phased using SHAPEIT2 (ref)  and aligned to the reference panels using Genotype Harmonizer (www.molgenis.org/​systemsgenetics) in order to resolve strand issues. Generation of cleaned pedigree files and in- and output files for different imputation algorithms: Pedigree files are created in PLINK binary format. Genotypes of SNPs not directly genotyped on the Cyto12-SNP chip, but genotyped in another study, can be inferred by imputation.
 Imputation analysis is performed through Beagle 3.1.0 and data in these formats will be made available. ​ Imputation analysis is performed through Beagle 3.1.0 and data in these formats will be made available. ​
  
-The samples were imputed using Minimac ((Howie B Fuchsberger C Stephens M Marchini J Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet  2012;​44:​955–59)) ​ version 2012.10.3 against human reference genomes, i.e. the Genome of The Netherlands (GoNL) release 5 ((Genome of The Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet  2014;​46:​818–25)) ​and the 1000 Genomes phase1 v3 ((1000 Genomes Project Consortium, Abecasis GR Altshuler ​Det al. A map of human genome variation from population-scale sequencing. Nature ​ 2010;​467:​1061–73)) reference panels. 1000G is a human reference panel which contains world-wide collected genomes. Since only a small population of Dutch people (from the North) are part of the 1000G panel, the genotyped SNP dataset from the array was also imputed against GoNL. This is a reference panel that contains genomes from individuals with Dutch ancestry (with parents born in the Netherlands). 165 genomes in this panel were collected from Lifelines participants,​ meaning that genome data from Northern-Dutch people are available. The GoNL panel showed more accurate genotypes when imputing European samples compared with the 1000 Genomes Phase1 v3 reference panel. The MOLGENIS compute ((Byelas H Dijkstra M Neerincx Pet al. Scaling bio-analyses from computational clusters to grids. Proceedings of Fifth International Workshop on Science Gateways (IWSG), 2013, 3–5 June. Zurich, Switzerland , 2013)) imputation pipeline was used to generate and monitor our job scripts on the distributed file system.+The samples were imputed using Minimac((Howie B Fuchsberger C Stephens M Marchini J Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet  2012;​44:​955–59)) ​ version 2012.10.3 against human reference genomes, i.e.  
 +  * the Genome of The Netherlands (GoNL) release 5((Genome of The Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet  2014;​46:​818–25)) ​ 
 +  * the 1000 Genomes phase1 v3((1000 Genomes Project Consortium, Abecasis GRAltshuler ​D, et al. A map of human genome variation from population-scale sequencing. Nature ​ 2010;​467:​1061–73)) reference panels. ​ 
 + 
 +1000G is a human reference panel which contains world-wide collected genomes. Since only a small population of Dutch people (from the North) are part of the 1000G panel, the genotyped SNP dataset from the array was also imputed against GoNL. This is a reference panel that contains genomes from individuals with Dutch ancestry (with parents born in the Netherlands). 165 genomes in this panel were collected from Lifelines participants,​ meaning that genome data from Northern-Dutch people are available. The GoNL panel showed more accurate genotypes when imputing European samples compared with the 1000 Genomes Phase1 v3 reference panel. The MOLGENIS compute((Byelas 
 +et al. Scaling bio-analyses from computational clusters to grids. Proceedings of Fifth International Workshop on Science Gateways (IWSG), 2013, 3–5 June. Zurich, Switzerland , 2013)) imputation pipeline was used to generate and monitor our job scripts on the distributed file system.
  
 In summary:\\ In summary:\\
  
--1000G imputed: missing genotypes of participants are imputed (or predicted) based on 1000G reference panel (taken from worldwide population) and the available generated genotypes of the SNP array\\+  * 1000G imputed: missing genotypes of participants are imputed (or predicted) based on 1000G reference panel (taken from worldwide population) and the available generated genotypes of the SNP array\\ 
 +  * GoNL imputed: missing genotypes of participants are imputed (or predicted) based on the GoNL reference panel (taken from the Dutch population) and the available generated genotypes of the SNP array\\ 
 +  * Unimputed: Genotypes determined on the basis of the SNP array, not imputed against genomic reference panels.
  
--GoNL imputed: missing genotypes of participants ​are imputed (or predicted) based on the GoNL reference panel (taken from the Dutch population) and the available generated genotypes of the SNP array\\+===== Non-Caucasian samples ===== 
 +To prevent false-positive association,​ non-Caucasian samples ​are excluded. ​
  
--Unimputed: Genotypes determined on the basis of the SNP array, not imputed against genomic reference panels. 
- 
-==== Non-Caucasian samples ==== 
-To prevent false-positive association,​ non-Caucasian samples are excluded. ​ 
 Samples are determined by: Samples are determined by:
 +  * The LifeLines phenotype database (self-report)\\
 +  * Outlier (IBS) analysis\\
 +  * Population stratification (using Eigenstrat)
  
-• The LifeLines phenotype database (self-report)\\ +===== Cryptic relationships ​===== 
- +
-• Outlier (IBS) analysis\\ +
- +
-• Population stratification (using Eigenstrat) +
- +
- +
-==== Cryptic relationships ==== +
 Samples are selected using self reported family relations. After cleaning of the data, samples are compared with each other to determine the relationship by genetic similarity. If a pair of samples are indicated as first degree relatives, the sample with the best genotyping quality will be included. Samples are selected using self reported family relations. After cleaning of the data, samples are compared with each other to determine the relationship by genetic similarity. If a pair of samples are indicated as first degree relatives, the sample with the best genotyping quality will be included.
  
Line 96: Line 89:
 ===== Releasing SNP genotype data ===== ===== Releasing SNP genotype data =====
 The following files are available through the Lifelines workspace or HPC: The following files are available through the Lifelines workspace or HPC:
- +  * files with phenotype data  
-files with phenotype data +  ​* ​files with genotyped and imputed data 
- +  ​* ​quality control files: 
-files with genotyped and imputed data +      ​* ​list of samples excluded 
- +      ​* ​list of SNPs excluded  
-quality control files: +      ​* ​PCA component file                                                                                                                                                                                                                                                                                             
- -list of samples excluded +
- -list of SNPs excluded  +
- -PCA component file                                                                                                                                                                                                                                                                                             +
  
 ===== Abbreviations ===== ===== Abbreviations =====
Line 114: Line 104:
 | PLINK  | PLINK is a command line program written in C/C++  | | PLINK  | PLINK is a command line program written in C/C++  |
  
- 
- 
-''''​Monospaced Text''''​ 
gwas.1562858910.txt.gz · Last modified: 2019/11/07 16:07 (external edit)