3 Imputed Genomics
This page provides considerations for A2CPS projects that involve the Imputed Genomics Data.
The imputed genomic dataset expands single-nucleotide resolution variant data across the human genome, offering deeper and more comprehensive coverage. Genotype imputation aligns SNP array data with ancestry-specific reference haplotypes to infer untyped variants with high confidence. This process can be compared to filling in missing letters in a sentence based on contextual knowledge — for example:
“T_e l_zy d_g j___ed _ver the _at.”
Following imputation, the dataset increased from 690,126 to 11,016,319 genetic loci, spanning all 22 autosomes and the X chromosome for 1,375 participants. Prior to imputation, data were preprocessed and formatted for compatibility with the Imputation Server. Post-imputation, we applied rigorous quality control, excluding low-confidence variants and those with a minor allele frequency (MAF) below 1%.
3.1 Starting Project
3.1.1 Locate Data
On TACC, the data are stored underneath the releases. For example, data release v2.1.0
is underneath
/corral-secure/projects/A2CPS/products/consortium-data/pre-surgery-release-2-1-0/omics/gene_variants
The single-nucleotide polymorphisms are underneath the omics/gene_variants
folder
$ ls /corral-secure/projects/A2CPS/products/consortium-data/pre-surgery-release-2-1-0/omics/gene_variants
2025-06-02_UCSD_GV_Genotypes_Runlists_1-5_QC_freeze2.bed
2025-06-02_UCSD_GV_Genotypes_Runlists_1-5_QC_freeze2.bim
2025-06-02_UCSD_GV_Genotypes_Runlists_1-5_QC_freeze2.fam
2025-06-02_UCSD_GV_Genotypes_Runlists_1-5_QC_freeze2_Imputed_Genomics.bed
2025-06-02_UCSD_GV_Genotypes_Runlists_1-5_QC_freeze2_Imputed_Genomics.bim
2025-06-02_UCSD_GV_Genotypes_Runlists_1-5_QC_freeze2_Imputed_Genomics.fam
2025-06-02_UCSD_GV_Genotypes_Runlists_1-5_QC_freeze2_Imputed_Genomics.log
2025-06-02_UCSD_GV_Genotypes_Runlists_1-5_QC_freeze2.log
2025-06-02_UCSD_GV_Genotypes_Runlists_1-5_QC_freeze2.nosex
2025-06-02_UCSD_GV_Genotypes_Runlists_1-5_QC_freeze2_PCA_Ind_Info.csv
These files are in the PLINK format.
For more detail on Plink Binary File information and data extraction for genetic variant data, refer to the Genetic Variant starter kit.
3.2 Imputation Process
The below procedures were performed with these data:
2025-06-02_UCSD_GV_Genotypes_Runlists_1-5_QC_freeze2.bed
2025-06-02_UCSD_GV_Genotypes_Runlists_1-5_QC_freeze2.bim
2025-06-02_UCSD_GV_Genotypes_Runlists_1-5_QC_freeze2.fam
(these files originally had base name A2CPS_Freeze_2, as reflected in the code below)
3.2.1 Pre-Imputation Data Preparation
Following the Data Preparation Process as outlined here:
TOPMed Imputation Server Data Preparation
Click to show the Pre-Imputation Data Processing Bash Script, which utilizes Plink 1.9
mkdir A2CPS_RL1-5_For_Imputation_F2
# 1. Subset out Y Chromosome and MT DNA
mkdir A2CPS_RL1-5_For_Imputation_F2/Plink_Merged
./plink --bfile Freeze_2/A2CPS_Freeze_2 --chr 1-23 --keep-allele-order --make-bed --out A2CPS_RL1-5_For_Imputation_F2/Plink_Merged/A2CPS_RL1-5_For_Imputation_F2
# 2. Run TOPMed Panel Check
./plink --freq --bfile A2CPS_RL1-5_For_Imputation_F2/Plink_Merged/A2CPS_RL1-5_For_Imputation_F2 --out A2CPS_RL1-5_For_Imputation_F2/Plink_Merged/A2CPS_RL1-5_For_Imputation_F2
perl HRC-1000G-check-bim-v4.3.0/HRC-1000G-check-bim.pl -b A2CPS_RL1-5_For_Imputation_F2/Plink_Merged/A2CPS_RL1-5_For_Imputation_F2.bim -f A2CPS_RL1-5_For_Imputation_F2/Plink_Merged/A2CPS_RL1-5_For_Imputation_F2.frq -r CreateTOPMed/PASS.Variants.TOPMed_freeze5_hg38_dbSNP.tab -h
sh ./A2CPS_RL1-5_For_Imputation_F2/Plink_Merged/Run-plink.sh
# 3. Sort GZ.VCF Files by Genomic Position
mkdir A2CPS_RL1-5_For_Imputation_F2/GZ_VCF_By_Chr_Sorted
for chr in {1..23}; do
bcftools sort ./A2CPS_RL1-5_For_Imputation_F2/Plink_Merged/A2CPS_RL1-5_For_Imputation_F2-updated-chr${chr}.vcf.gz -Oz -o ./A2CPS_RL1-5_For_Imputation_F2/GZ_VCF_By_Chr_Sorted/A2CPS_RL1-5_chr${chr}_sorted.vcf.gz
done
# 4. Add chr in front of chromosome number: "If your input data is GRCh38/hg38, please ensure chromosomes are encoded with prefix 'chr' (e.g. chr20)."
mkdir A2CPS_RL1-5_For_Imputation_F2/GZ_VCF_By_Chr_Sorted_With_Chr
for chr in {1..23}; do
zcat ./A2CPS_RL1-5_For_Imputation_F2/GZ_VCF_By_Chr_Sorted/A2CPS_RL1-5_chr${chr}_sorted.vcf.gz | awk 'BEGIN {OFS="\t"} {if($0 !~ /^#/) $1="chr"$1; print}' | bgzip -c > ./A2CPS_RL1-5_For_Imputation_F2/GZ_VCF_By_Chr_Sorted_With_Chr/A2CPS_RL1-5_chr${chr}_sorted_with_chr.vcf.gz
done
3.2.2 Imputation
After preparing the files, they were run through:
3.2.3 Post-Imputation Quality Control
Imputed variants with an imputation quality score (R²) less than 0.3 were excluded.
Imputed variants with minor allele frequencies (MAF) below 0.01 were excluded to remove extremely rare variants.
Imputed variants lacking rsIDs (denoted as ‘.’) or sharing duplicated genomic positions were excluded, as these caused issues during PLINK merging.
3.2.4 Considerations while working with these data
Population Structure still exists
Researchers can choose to filter out variants based on other, standard QC measures that suit their studies.
Click to show the Post-Imputation Quality Control Bash Script, which utilizes Plink 1.9
# 1. Run post_imputation_qc.R
# This creates update_ids.txt, update_sexes.txt, and chr_${chr}/keep_ranges.txt,
# 2. MAKE Binary Files
for chr in {1..22}; do
./plink2 --vcf chr_${chr}/chr${chr}.dose.vcf.gz dosage=DS --make-bed --out chr_${chr}/chr${chr}_dose
done
# 3. UPDATE IDs
for chr in {1..22}; do
./plink2 --bfile chr_${chr}/chr${chr}_dose --update-ids update_ids.txt --make-bed --out chr_${chr}/chr${chr}_dose_updated_IDs
done
# 4. UPDATE SEXES
for chr in {1..22}; do
./plink2 --bfile chr_${chr}/chr${chr}_dose_updated_IDs --update-sex update_sexes.txt --make-bed --out chr_${chr}/chr${chr}_dose_updated_sexes
done
# 5. Keep variants with (MAF > 0.01 and R2 > 0.3)
for chr in {1..22}; do
./plink2 --bfile chr_${chr}/chr${chr}_dose_updated_sexes --extract range chr_${chr}/keep_ranges.txt --make-bed --out chr_${chr}/chr${chr}_dose_qced
done
# X CHR #
# 6. MAKE Binary Files
./plink2 --vcf chr_X/chrX.dose.vcf.gz dosage=DS --make-bed --update-sex update_sexes_X.txt --out chr_X/chrX_dose
# 7. Update IDs
./plink2 --bfile chr_X/chrX_dose --make-bed --update-ids update_ids.txt --out chr_X/chrX_updated_IDs
# 8. Keep only variants that passed QC
./plink2 --bfile chr_X/chrX_updated_IDs --extract range chr_X/keep_ranges.txt --make-bed --out chr_X/chrX_dose_qced
# 9. Rename each file
mkdir ./Imputation_QCed
for chr in {1..22} X; do
mv chr_${chr}/chr${chr}_dose_qced.bed Imputation_QCed/chr${chr}_imputed_qced.bed
mv chr_${chr}/chr${chr}_dose_qced.bim Imputation_QCed/chr${chr}_imputed_qced.bim
mv chr_${chr}/chr${chr}_dose_qced.fam Imputation_QCed/chr${chr}_imputed_qced.fam
mv chr_${chr}/chr${chr}_dose_qced.log Imputation_QCed/chr${chr}_imputed_qced.log
done
# 10. Identify SNPs that were duplicated during Imputation or have "." as their rsID
duplicate.R
# 11. Drop Duplicate and "." SNPs. Merging files for a combined imputed dataset does not function with duplicates for genomic position or identifiers
mkdir ./Imputation_QCed/No_Duplicates
for chr in {1..22} X; do
./plink --bfile Imputation_QCed/chr${chr}_imputed_qced --exclude Imputation_QCed/duplicate_snp_list.txt --make-bed --out Imputation_QCed/No_Duplicates/chr${chr}_No_Duplicates
done
# 12. Drop Variants that are impossibly called for X Chromosome
awk '{print $3}' Imputation_QCed/No_Duplicates/chrX_No_Duplicates.hh | sort -u > Imputation_QCed/hh_snps_x_to_exclude.txt
./plink --bfile Imputation_QCed/No_Duplicates/chrX_No_Duplicates --exclude Imputation_QCed/hh_snps_x_to_exclude.txt --make-bed --out Imputation_QCed/No_Duplicates/chrX_imputed_qced_no_hh
# 13. Merge Chromosomes 1 - 22, X
mkdir ./Imputation_QCed/Final
./plink --bfile Imputation_QCed/No_Duplicates/chr1_No_Duplicates --merge-list Imputation_QCed/a2cps_imputed_qc_merge_list.txt --make-bed --out Imputation_QCed/Final/A2CPS_Imputed_Genomics
3.2.4.1 Supplementary QC Files: RScripts
Click to show Supplementary RScript #1: Variant-Level QC Based on Quality Metrics(post_imputation_qc.R
library(data.table)
setwd("./Imputation_Results/")
<- c(1:22, "X")
chromosomes for (chr in chromosomes) {
<- paste0("chr_", chr)
folder <- file.path(folder, paste0("chr", chr, ".info.gz"))
file_path
<- fread(file_path)
dose
c("AF", "MAF", "AVG_CS", "R2") := {
dose[, <- sub(".*AF=([^;]*).*", "\\1", INFO)
af <- sub(".*MAF=([^;]*).*", "\\1", INFO)
maf <- sub(".*AVG_CS=([^;]*).*", "\\1", INFO)
avg.cs <- sub(".*R2=([^;]*).*", "\\1", INFO)
r2 list(as.numeric(af), as.numeric(maf), as.numeric(avg.cs), as.numeric(r2))
}]
.001 <- dose[MAF > 0.01]
maf.03 <- maf.001[R2 > 0.3]
r2setnames(r2.03, "#CHROM", "CHROM", skip_absent=TRUE)
.03[, chr := sub("^chr", "", CHROM)]
r2.03[, start := POS]
r2.03[, end := POS]
r2
<- r2.03[, .(chr, start, end)]
keep_ranges
fwrite(keep_ranges, file.path(folder, "keep_ranges.txt"), sep = "\t", col.names = FALSE, quote = FALSE)
}
# Update IDs
<- fread("./chr_1/chr1_dose.fam")
fam
<- fam$V1
OldFID <- fam$V2
OldIID
<- sub("^(.*?)_.*", "\\1", OldIID)
NewFID <- sub(".*?_(.*)", "\\1", OldIID)
NewIID
<- data.table(OldFID, OldIID, NewFID, NewIID)
update_ids fwrite(update_ids, "update_ids.txt", sep = "\t", col.names = FALSE, quote = FALSE)
# Update Sexes
<- fread("./A2CPS_RL1-5_For_Imputation_F2.fam")
fam.og <- data.table(fam.og$V1, fam.og$V2, fam.og$V5)
update_sexes fwrite(update_sexes, "update_sexes.txt", sep = "\t", col.names = FALSE, quote = FALSE)
<- fread("./A2CPS_RL1-5_For_Imputation_F2.fam")
fam.og <- fread("./chr1_dose.fam")
fam.now <- data.table(fam.now$V1, fam.now$V2, fam.og$V5)
update_sexes fwrite(update_sexes, "update_sexes_X.txt", sep = "\t", col.names = FALSE, quote = FALSE)
Click to show Supplementary RScript #2: Merge Preparation (duplicate.r)
library(data.table)
setwd("./Imputation_Results/Imputation_QCed/")
<- data.table(V1=character(), V2=character(), V4=integer())
combined_data
for (chr in c(1:22, "X")) {
<- paste0("chr", chr, "_imputed_qced.bim")
file_name <- fread(file_name, select = c(1,2,4), header = FALSE)
dt <- rbindlist(list(combined_data, dt), use.names = TRUE, fill = TRUE)
combined_data
}
<- combined_data[, .N, by=.(V1, V4)][N > 1]
dup_pos <- combined_data[dup_pos, on=.(V1, V4), nomatch=0, V2]
snps_dup_pos <- combined_data[, .N, by=V2][N > 1, V2]
dup_snps
<- unique(c(snps_dup_pos, dup_snps))
snps_to_exclude
fwrite(as.data.table(snps_to_exclude), "duplicate_snp_list.txt", col.names = FALSE)
3.2.5 Citations
In publications or presentations including data from A2CPS, please include the following statement as attribution:
Data were provided [in part] by the A2CPS Consortium funded by the National Institutes of Health (NIH) Common Fund, which is managed by the Office of the Director (OD)/ Office of Strategic Coordination (OSC). Consortium components and their associated funding sources include Clinical Coordinating Center (U24NS112873), Data Integration and Resource Center (U54DA049110), Omics Data Generation Centers (U54DA049116, U54DA049115, U54DA049113), Multi-site Clinical Center 1 (MCC1) (UM1NS112874), and Multi-site Clinical Center 2 (MCC2) (UM1NS118922).