> For the complete documentation index, see [llms.txt](https://help.multiomics.illumina.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://help.multiomics.illumina.com/icm/analyses/analysis-functionality/task-menu/variant-analysis/create-genotype-matrix.md).

# Create genotype matrix

A genotype matrix is a numerical representation of genetic variants where rows represent observations (like samples, subject etc.), columns represent variants (like SNV, SNP etc.), the values in the matrix are genotype encoding:

* 0 — reference homozygous
* 1 — heterozygous
* 2 — alternate homozygous

Genotype matrix can be used for analyses like GWAS, QTL PCA, Kinship etc.

This task is available on data nodes containing variants.

Click on a variants data node to choose **Variant analysis > Create genotype matrix** from the task menu:

<div align="left"><figure><img src="/files/ED3V5pAlhkZbSmuYakRF" alt="" width="563"><figcaption></figcaption></figure></div>

Configure the dialog and click **Finish** to run the task:

<div align="left"><figure><img src="/files/aXAiqJsPEdjw7zKKj8m1" alt=""><figcaption></figcaption></figure></div>

Merge mode:

* Intersect: use variants presented in all samples from the input variant data node to create the matrix
* Union: include all variants presented in any of the samples from the input data node. If a variant is not found in some samples, the value will be 0, in other words they are treated as reference homozygous genotype calls in those samples.
* Minor Allele Frequency (MAF): it is the frequency at which the second most common allele occurs in the input samples. The value is between 0 and 1.\
  \&#xNAN;*Note: sample size matters on this parameter, MAF estimates are more accurate with later sample sample sizes.*\
  When this option is selected, minimum MAF threshold parameter needs to be specified, variants with MAF values smaller than the cutoff will be filtered out. The default is 0.01, typically MAF < 0.01 are rare variants.
* Linkage Disequilibrium (LD) Pruning: LD measures association of alleles at different loci, high value in r<sup>2</sup> indicates high correlation between two variants. r<sup>2</sup> ranges from 0 to 1. When this option is selected, it will remove redundant SNVs to create a set of approximately independent variants.
  * Window size — consider the number of variants in a window at a time.
  * Step size — slide the window forward by this number of SNV each iteration.
  * LD r<sup>2</sup> — When a SNV pair r<sup>2</sup> is higher value than the number specified, the variant that is in higher LD with more other variants in the window will be removed. If both SNVs have the same number of correlated partners, the variant with higher genomic order is removed<sup>1</sup>. Typically SNVs with r<sup>2</sup> <=0.2 are considered independent.
* Variant Quality Filtering: based on QUAL field in the vcf file. When the option is selected, SNVs with QUAL values lower than the specified number are remove.
* Hardy-Weinberg Equilibrium (HWE): HWE assumes variant's genotypes follow expected proportions in the population. The observes genotype counts in the input samples compare with expected genotype counts using exact test<sup>2</sup>. When select this option, SNVs with p-value smaller than the specified value indicates they deviate from HWE and will be removed.

Output data node contains the sample by variant matrix.

### Reference

1. <https://www.cog-genomics.org/plink/2.0/>
2. Wigginton JE, Cutler DJ, Abecasis GR. A note on exact tests of Hardy-Weinberg equilibrium. Am J Hum Genet. 2005 May;76(5):887-93. doi: 10.1086/429864. Epub 2005 Mar 23. PMID: 15789306; PMCID: PMC1199378.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.multiomics.illumina.com/icm/analyses/analysis-functionality/task-menu/variant-analysis/create-genotype-matrix.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
