indoaryan.com

Decoding Indo-Aryan Prehistory Through Genetic Data

qpadm-tutorial2

qpAdm Tutorial

qpAdm Tutorial for South Asian Populations

qpAdm is a powerful tool used in population genetics to estimate the proportions of ancestry from different source populations in a target population. It is part of the ADMIXTOOLS software package developed by David Reich’s lab at Harvard University. qpAdm is particularly important in studies of population history because it helps researchers understand the genetic contributions of various ancestral populations to modern groups.

Table of Contents

Introduction

qpAdm is a powerful tool used in population genetics to estimate the proportions of ancestry from different source populations in a target population. It is part of the ADMIXTOOLS software package developed by David Reich’s lab at Harvard University. qpAdm is particularly important in studies of population history because it helps researchers understand the genetic contributions of various ancestral populations to modern groups.

What You Will Learn

By the end of this tutorial, you will learn:

  • The fundamental parameters and assumptions of qpAdm.
  • How to set up the necessary software tools and prepare your data.
  • The process of running a qpAdm analysis and interpreting the results.
  • Common issues and best practices for qpAdm analysis.

Assumptions

This tutorial assumes that you have:

  • Basic knowledge of Linux commands

Tools Used

Here are the tools and resources you will need for this tutorial:

Understanding qpAdm Parameters

In qpAdm analysis, understanding the key parameters and their roles is essential for setting up your model accurately. qpAdm is a powerful statistical tool used in population genetics to estimate the proportions of ancestry from different source populations in a target population. Let’s break down the main components.

Target Population

The target population is the group you are analyzing to understand its genetic ancestry. In this tutorial, we use “Sohi” as our target population. The target population is typically the one for which you want to determine the proportion of ancestry from various source populations.

  • The target is usually a present-day population or a recent historical population.
  • You should have high-quality genetic data for your target population.
  • The target population is assumed to be a mixture of the source populations.

Source Populations

Source populations are groups that you hypothesize have contributed to the ancestry of the target population. In qpAdm, you need to provide a list of at least two source populations. These populations should be genetically distinct and relevant to the historical context of the target population.

  • Source populations should be older than or contemporaneous with the target population.
  • They should represent potential ancestral populations based on archaeological, historical, or genetic evidence.
  • The more distinct the source populations are from each other, the better qpAdm can distinguish their contributions.

For this tutorial, our example source populations are:

  • Iran_ShahrISokhta_BA2: Representing ancient populations from Iran.
  • Kazakhstan_Andronovo.SG: Reflecting ancient populations from Central Asia.
  • Turkmenistan_Gonur_BA_1: Representing Bronze Age populations from Turkmenistan.

Right Populations (Right Pops)

Right populations, also known as right pops or outgroups, serve as reference points to help model the ancestry of the target population. They are crucial for distinguishing the ancestry contributions from different source populations. The right populations typically include a variety of ancient and modern populations that did not contribute directly to the target population. First population is always a basal African population, Mbuti.DG in this case. For this tutorial, our example right populations are:

  • Mbuti.DG: An African basal population.
  • China_Tianyuan: An ancient East Asian population.
  • Karitiana.DG: A Native American population.
  • Russia_Ust_Ishim_HG.DG: An ancient Siberian population.
  • Ami.DG: A Taiwanese aboriginal population.
  • Dai.DG: A Southeast Asian population.
  • Turkey_N: An ancient population from Anatolia.
  • Georgia_Kotias.SG: A prehistoric population from the Caucasus.
  • Russia_Kostenki14.SG: An Upper Paleolithic European population.
  • Iran_GanjDareh_N: An early Neolithic population from Iran.

Example of a Standard qpAdm Model

A standard qpAdm model can be expressed as follows:

Target population (Target) = source population 1 (Source 1) + source population 2 (Source 2)

The qpAdm output will provide:

  • A p-value (also called tail probability or tailprob), which indicates the statistical fit of the model.
  • Admixture coefficients (x and y) for Source 1 and Source 2, respectively, such that x + y = 1 (or 100%).
  • Standard errors for the admixture coefficients, indicating the precision of the estimates.

Successful Model Criteria

For a qpAdm model to be considered successful, it should meet the following criteria:

  • High p-value: Typically above 0.05, indicating a good fit between the model and the observed data.
  • Low standard errors: Indicating precise estimates of the admixture coefficients.
  • Positive admixture coefficients: Indicating plausible contributions from the source populations.

Setting Up the Environment

Before running qpAdm, you need to set up the necessary software tools and environment. This section will guide you through installing AdmixTools and setting up Plink.

Installing AdmixTools

  1. Download and extract AdmixTools:
    • Visit the GitHub repository and download the latest version of AdmixTools.
    • Extract the downloaded files to a directory on your computer.
  2. Navigate to the src/ directory:
    cd AdmixTools-master/src
  3. Install dependencies:
    sudo apt-get install build-essential
    sudo apt-get install libgsl-dev
    sudo apt-get install libopenblas-dev
  4. Compile the programs:
    make clobber
    make all
    make install
  5. Test qpAdm:
    cd ../bin
    ./qpAdm
  1. Download and extract Plink 1.90:
    • Visit the PLINK website and download the latest version of Plink 1.90.
  2. Copy the executables:
    • Extract the downloaded files and copy the Plink and prettify executables into the bin folder of your AdmixTools directory.

Preparing Your Data

Proper data preparation is crucial for a successful qpAdm analysis. This section will guide you through downloading necessary datasets, processing raw data files, converting to EIGENSTRAT format, and merging with reference datasets.

Downloading Necessary Datasets

  1. Download the 1240k Eigenstrat database:
  2. Create a dataset folder:
    • Create a new folder called dataset in the bin directory of your AdmixTools installation and copy the extracted dataset files into this folder.

Processing 23andMe and AncestryDNA Files

  1. Download and extract the 23andMe RAW DNA file:
    • Follow the instructions on the PLINK website to download and extract your 23andMe raw data file.
  2. Convert the raw data file to binary format:
    ./plink --23file 23andme_Sohi_v5.txt Sohi 1 --make-bed --out Sohi_23andme_merged

Prepare AncestryDNA Data File

  1. Combine 23andMe and AncestryDNA raw data files:
    • Refer to the guide on Genetic Lifehacks to combine your 23andMe and AncestryDNA raw data files.
  2. Strip header information from the AncestryDNA file:
    awk 'BEGIN {FS="\t"};{print $1"\t"$2"\t"$3"\t"$4 $5}' AncestryDNA.txt > AncestryCombined.txt
  3. Convert the processed file to binary format:
    ./plink --23file AncestryCombined.txt Sohi 1 --make-bed --out Sohi_Ancestry_merged

Converting to EIGENSTRAT Format

  1. Create a parameter file for conversion:
    • Create a new parameter file called convertf_param.par with the following content:
    genotypename: Sohi_23andme_merged_hh.bed
    snpname: Sohi_23andme_merged_hh.bim
    indivname: Sohi_23andme_merged_hh.fam
    outputformat: EIGENSTRAT
    genotypeoutname: Sohi_23andme_eigenstrat.geno
    snpoutname: Sohi_23andme_eigenstrat.snp
    indivoutname: Sohi_23andme_eigenstrat.ind
  2. Run the conversion:
    convertf -p convertf_param.par

Merging with Reference Datasets

  1. Create a merge parameter file:
    • Create a new parameter file called merge_param.par with the following details:
    geno1: /home/cdr/AdmixTools-master/bin/dataset/v54.1.p1_1240K_public.geno
    snp1: /home/cdr/AdmixTools-master/bin/dataset/v54.1.p1_1240K_public.snp
    ind1: /home/cdr/AdmixTools-master/bin/dataset/v54.1.p1_1240K_public.ind
    geno2: /home/cdr/AdmixTools-master/bin/Sohi_23andme_eigenstrat.geno
    snp2: /home/cdr/AdmixTools-master/bin/Sohi_23andme_eigenstrat.snp
    ind2: /home/cdr/AdmixTools-master/bin/Sohi_23andme_eigenstrat.ind
    genooutfilename: /home/cdr/AdmixTools-master/bin/Sohi_merged_with_1240K.geno
    snpoutfilename: /home/cdr/AdmixTools-master/bin/Sohi_merged_with_1240K.snp
    indoutfilename: /home/cdr/AdmixTools-master/bin/Sohi_merged_with_1240K.ind
    outputformat: EIGENSTRAT
  2. Run the merge:
    ./mergeit -p merge_param.par

Running qpAdm Analysis

This section provides a detailed procedure for setting up and running qpAdm.

Creating Parameter Files

  1. Create folders for fstat results:
    • Create new folders called fstat_23andme and fstat_ancestry in the bin directory.
  2. Create the parameter file for qpfstats:
    • Create a file called parqpfstat.txt with the following content:
    DIR: /home/cdr/AdmixTools-master/bin
    S1: 10Oct21
    S1X: 10Oct21
    indivname: /home/cdr/AdmixTools-master/bin/Sohi_merged_with_1240K.ind
    snpname: /home/cdr/AdmixTools-master/bin/Sohi_merged_with_1240K.snp
    genotypename: /home/cdr/AdmixTools-master/bin/Sohi_merged_with_1240K.geno
    poplistname: /home/cdr/AdmixTools-master/bin/fstat_23andme/lista.txt
    fstatsoutname: /home/cdr/AdmixTools-master/bin/fstat_23andme/fstatsa.txt
    allsnps: YES
    inbreed: NO
    scale: NO

Setting Up Population Lists

  1. Create the population list file:
    • Create a file called lista.txt with the following label names of each population:
    Sohi
    Mbuti.DG
    Irula.DG
    Turkey_N
    Laos_LN_BA.SG
    China_Tianyuan
    Ami.DG
    Karitiana.DG
    Iran_GanjDareh_N
    Iran_C_SehGabi
    Iran_ShahrISokhta_BA1
    Iran_ShahrISokhta_BA2
    Turkmenistan_Gonur_BA_1
    Dai.DG
    Russia_Ust_Ishim_HG.DG
    Chukchi.DG
    Saami.DG
    Georgia_Kotias.SG
    Russia_Kostenki14.SG
    Russia_Tyumen_HG
    Russia_MLBA_Sintashta
    Russia_DevilsCave_N.SG
    Luxembourg_Loschbour.DG
    Czech_BellBeaker
    Kazakhstan_Central_Saka.SG
    Portugal_MN.SG
    Kazakhstan_Andronovo.SG

Running qpfstats and qpAdm

  1. Run qpfstats:
    ./qpfstats -p fstat_23andme/parqpfstat.txt > fstat_23andme/qpfstatlog.txt
    ./qpfstats -p fstat_ancestry/parqpfstat.txt > fstat_ancestry/qpfstatlog.txt
  2. Create the parameter file for qpAdm:
    • Create a file called parqpadm.txt with the following details:
    fstatsname: /home/cdr/AdmixTools-master/bin/fstat_ancestry/fstatsa.txt
    popleft: /home/cdr/AdmixTools-master/bin/fstat_ancestry/left.txt
    popright: /home/cdr/AdmixTools-master/bin/fstat_ancestry/right.txt
    details: YES
  3. Create the left and right population list files:
    • Create left.txt and right.txt files with the following content:
    left.txt:
    Sohi
    Iran_ShahrISokhta_BA2
    Kazakhstan_Andronovo.SG
    Turkmenistan_Gonur_BA_1
    right.txt:
    Mbuti.DG
    China_Tianyuan
    Karitiana.DG
    Russia_Ust_Ishim_HG.DG
    Ami.DG
    Dai.DG
    Turkey_N
    Georgia_Kotias.SG
    Russia_Kostenki14.SG
    Iran_GanjDareh_N
  4. Run qpAdm:
    qpAdm -p parqpadm.txt > sohi_qpadm_output.txt

Interpreting qpAdm Output

How to Read the qpAdm Output File

A successful qpAdm model will have:

  • High p-value: Typically above 0.05, indicating a good fit between the model and the observed data.
  • Low standard errors: Indicating precise estimates of the admixture coefficients.
  • Positive admixture coefficients: Indicating plausible contributions from the source populations.

To diagnose why a model may have failed, refer to the generated Dstats, also known as gendstats in the output file. These statistics compare the simulated model to the actual target sample. Large Z scores (above 3 or below -3) can indicate why a model failed.

Examples

Using 23andMe Data

23andMe Data Example

Using AncestryDNA Data

AncestryDNA Data Example

From the 23andMe output, our model with pattern 000 (i.e., all three populations) is infeasible even though the tail probability is 0.88841, which is > 0.05. This is because one of the admixture coefficients is negative. Pattern 001, meaning Iran_ShahrISokhta_BA2 and Kazakhstan_Andronovo.SG, seems to pass the test with a probability of 0.897671 and a coefficient of 66.9% for BA2 and 33.1% for Andronovo. Standard errors are also low at 0.06, 0.056, and 0.069.

Adjust the populations in left.txt and right.txt files and run the command again:

qpAdm -p parqpadm.txt > sohi_qpadm_output.txt

With the right combination of populations in left.txt, right.txt, and lista.txt, you can model your admixture. Example: qpadm model for Sohi – Pastebin.com

Troubleshooting and Tips

Common Issues and Solutions

  • Low p-value: Adjust the right populations to ensure they serve as proper outgroups.
  • High standard errors: Ensure you have enough genetic data and that the populations are well-differentiated.
  • Negative admixture coefficients: Check the left and right population lists for accuracy.

Best Practices for qpAdm Analysis

  • Always verify your dataset for completeness.
  • Use multiple outgroups to ensure robust results.
  • Cross-validate with different datasets to confirm findings.

Conclusion

This tutorial has provided a comprehensive guide to using qpAdm for South Asian population models. By following the steps outlined, you can set up and run qpAdm analyses, interpret the results, and troubleshoot common issues. qpAdm is a powerful tool in population genetics, and mastering it can provide valuable insights into the ancestry of populations.

Glossary

  • qpAdm: A tool used to estimate ancestry proportions from different source populations.
  • ADMIXTOOLS: A software package for analyzing population genetics data.
  • EIGENSTRAT: A data format used in population genetics.
  • p-value: A measure of the validity of a model.
  • Tail probability: Another term for p-value.
  • Admixture coefficients: Proportions of ancestry from source populations.

Leave a Reply

Your email address will not be published. Required fields are marked *