# 5-base DNA

## Getting Started

[Logging into ICM](https://help.multiomics.illumina.com/icm)

[Creating a Study from a ICA Project](https://help.multiomics.illumina.com/icm/studies/create-study)

[Viewing Results and Navigating in ICM](https://help.multiomics.illumina.com/icm/analyses/enter-analysis)

## Demo Data

Demo data that can be used to follow along with this walkthrough is found in the Connected Multiomics Demo Data repository. The dataset can be found at /Multiomics-Demo-Data/Methylation/Illumina 5-base-solution. This demo data consists of 6 samples, from two pheonotype groups. In this walkthrough, we outline analysis steps that can be performed to explore the data, identify differentially methylated regions between the two sample groups, and find pathways overrepresented in the differential test result.

5 files per sample are required to analyze the DNA Methylation Prep data in the Connected Multiomics software. Add the following 5 files for each sample from the demo data folder to a study prior to starting an analysis:

* \<sample name>.CX\_report.txt.gz
* \<sample name>.methyl\_metrics.csv
* \<sample name>.mapping\_metrics.csv
* \<sample name>.wgs\_coverage\_metrics.csv
* \<sample name>.M-bias.txt

These files are generated from DRAGEN analysis. [CX\_report file](https://illumina.gitbook.io/dna-methylation-prep/8yf1CuwpRlzpAWMSt3ib/additional-information/key-output-files-and-metrics) is the key output file that contains methylation reads count at single nucleotide level. The metrics files and the M-bias file contain QC metrics for reads mapping quality and methylation calling, which will be used to generate visualizations in 5-base Methylation QC task in the Connected Multiomics.

## Custom 5-base Methylation Analysis

<figure><img src="/files/eM0kgv6mgS3NPX4Xz52s" alt=""><figcaption></figcaption></figure>

### Creating a Custom Analysis

After all 6 samples are added into a study, follow these steps to create a Custom analysis in the Connected Multiomics:

* Click on **+ New Analysis.**
* In the pop-up window, provide a name for the analysis, select **Custom: 5-base Methylation** as the Analysis Type, choose a sample group to be included in the analysis (all samples option is selected by default), and click on the **Run Analysis** button.

<figure><img src="/files/HaOtwcC5tiEAa3gn8uM2" alt=""><figcaption></figcaption></figure>

* Refresh the page to get the latest status of the analysis.
* When the Status is Complete, it indicates that launching the analysis has started, click on the analysis tile to enter the analysis module. You will see an ongoing **Import cohort** task that is importing the data into this analysis. After the **Import cohort** task is completed, the first data node called **5-base Methylation** is generated.
* To review the number of samples and features, hover over the **5-base Methylation** data node. Features refer to CpG sites.

<figure><img src="/files/o3osTYWV6PzHGKLKGJgp" alt=""><figcaption></figcaption></figure>

\
The 5-base Methylation data node contains raw methylated counts and unmethylated counts for CpG sites present in the CX reports. For sites with methylation calls on both strands in the CX report, the strands are collapsed such that poisiton on the postive strand is used and the methylation counts are summed. This data node also contains percent methylation levels which will be used in exploratory analysis such as Principal Component Analysis (PCA).

### Add Sample Metadata

We use **Metadata** tab within an analysis to manage sample metadata. Follow these steps to create a new sample attribute called sampleGroup, and assign attribute value to each sample:

* Click on **Metadata** tab. In **Sample attributes** menu on the left, click **Manage**.
* In the Manage sample attributes page, click **Add new attribute**. Type in **sampleGroup** in the **Name** text box, click **Add**.

<figure><img src="/files/POJTsE4CsuHTKpuTxIw5" alt=""><figcaption></figcaption></figure>

* Click **+** button to add two category values **A, B** to the sampleGroup attribute.

<figure><img src="/files/B6d8ViOn79bDciqk8Pto" alt=""><figcaption></figcaption></figure>

* Click **Back to metadata tab**.
* Click **Assign values** under **Sample attributes**. Use dropdown at each sample to assign a category value for the sampleGroup attribute. Assign value for each sample as screenshot below.

<figure><img src="/files/fFhmpW65lVso3g98Mi82" alt=""><figcaption></figcaption></figure>

* Click **Apply changes** to save the assigned values.

### 5-base Methylation QC

The **5-base methylation QC** task in the Connected Multiomics enables you to visualize sample-level QC metrics that describe reads mapping quality and CpG methylation calling. The QC metrics are extracted from the DRAGEN analysis metric files that were ingested into the study. To invoke the **5-base methylation QC** task:

* At **Analyses** page, click on the **5-base Methylation** node.
* Click **QA/QC** section in the context-sensitive task menu on the right.
* Click **5-base methylation QC**.

After the **5-base methylation QC** task is completed, double-click on the task node to open the QC report in a data viewer. The QC report consists of plots and tables organized in 2 sheets. Click sheet name at the bottom of the data viewer to navigate from one sheet to another.

<figure><img src="/files/2TpjKz1izD9h8a6MRGdJ" alt=""><figcaption></figcaption></figure>

* Sheet **Metrics** shows sample-level QC metrics plot. Each sample is a data point, they are randomly spead out on x-axis. The QC metric is represented by y-axis. Each plot is overlay with a violin plot to show distribution of the QC metrics.
  * Percent methylation in samples: Percentages of CpG methylation in samples.
  * Percent methylation in unmethylated control: Percentage of CpG methylation in the unmethylated control (lambda). Low value indicates good quality.
  * Percent methylation in methylated control: Percentage of CpG methylation in the methylated control (pUC19). High value indicates good quality.
  * Percent duplicate reads: Percentage of duplicate marked reads, as a result of PCR amplification.
  * Percent mapped reads: Percentage of mapped reads, indicate the alignment rate.
  * Average autosomal coverage: Mean autosomal coverage across the whole genome. Higher coverage indicates the counts of methylated/unmethylated more accurately reflects the true methylation amount at any particular site.
  * QC metrics table: Text representations of the QC metrics plots.

<figure><img src="/files/CrNBEOoCRxdaLU80UFuT" alt=""><figcaption></figcaption></figure>

* Sheet **M-bias** shows M-bias plots for methylation level and coverage across positions on read1 and read2. The M-bias should be consistent across all positions. It is common for the first/last 10 bases to have un-even methylation due to end-repair and sequencing artifacts.

### PCA

The principal components analysis (PCA) scatter plot allows us to visualize similarities and differences between the samples in a dataset. To invoke a PCA task:

* Click on the **5-base Methylation** node.
* Click **Exploratory analysis** section in the context-sensitive task menu.
* Click **PCA**.
* Set to use the top **100,000** features with the highest **variance** in calculation.
* Keep the rest of the parameters as default, and click **Finish**.

<figure><img src="/files/BtxKy0C2NP2eCG6tCIPx" alt=""><figcaption></figcaption></figure>

After the PCA task is completed, double click on the **PCA** node to view the PCA plot in a data viewer.

<figure><img src="/files/B4nDuLHFpDQyaHUlsrpD" alt=""><figcaption></figcaption></figure>

* The scatter plot shows the data distribution among the first three PCs. Each sample is a data point.
* The scree plot (top right panel) shows variance represented by each PC.
* The component loading table (bottom right panel) shows the correlation between CpG methylation sites and PC.
* For additional information on PCA, refer to the [PCA documentation](/icm/analyses/analysis-functionality/task-menu/exploratory-analysis/pca.md).

### Detect Differentially Methylated Regions (DMRs)

[DSS ](broken://spaces/5WPPw051cYE3Zthy5U7m/pages/onVXlqYRra4JPW2pj7bS)(Dispersion Shrinkage for Sequencing data) enables the detection of differentially methylation regions using counts data at single nucleotide level. It uses beta-binomial distribution to model methylation counts at each CpG site and uses Wald test to identify differentially methylation loci (DML). Nearby DMLs are then merged into a region to form differentially methylated region (DMR). Set up a DSS task to identify DMRs between two sample groups:

* Click on the **5-base Methylation** node.
* Click **Statistics** section in the context-sensitive task menu.
* Click **Differential Methylation**.
* Select **DSS** as the Method to use for differential methylation analysis, click **Next.**
* Select **sampleGroup** as factor for analysis, click **Next.**

<figure><img src="/files/q1SklcQ9stlV8EUx9zLs" alt=""><figcaption></figcaption></figure>

* Drag **A** to the top right Numerator box and **B** to the bottom right Denominator box. Click on **Add comparison**.
* Keep the rest of the settings as default, then click **Finish**.

<figure><img src="/files/6HT6JRKtqvUJ2Uwi4i7f" alt=""><figcaption></figcaption></figure>

When the DSS task is completed, double click on the **A vs B (DMR)** node to open the DMR report. The DSS DMR task report lists regions on rows and the test statistics (areaStat, diff.Methy, etc.) on columns. Regions are listed in descending order by the abs(areaStat) so that the most significant DMR is listed first. diff.Methy statistics reports the difference in average methylation between the two groups, negative value indicates A is hypomethylated compared to B in the region, while positive value indicates A is hypermethylated compared to B in the region. Refer to [DSS documentation](broken://spaces/5WPPw051cYE3Zthy5U7m/pages/onVXlqYRra4JPW2pj7bS) to learn more about the differential methylation report.

On the DMR report, click on the **volcano icon** ( ![](/files/IBVAyQSZVGSksLDXMJWw) ) next to the comparison name to open a differential methylation plot in a Data Viewer. Each data point in the plot is a region. The plot can be colored based on user-defined hypo- and hypermethylation thresholds:

* Click anywhere within the plot canvas on the top panel to select the plot.
* Click **Configure** icon on the left, click **Style**. In the Style dialog, set **Color by** option to **Significance**.
* Click **Configure** icon on the left, click **Statistics**. In the Statistics dialog, set **X threshold** to **-0.2** and **0.2**. Drag **Y threshold** sliding bar to maximum.

The regions are now colored in the volcano plot. Hypomethylated regions (diff.Methy < -0.2) are colored in blue, hypermethylated regions (diff.Methy > 0.2) are colored in red.

<figure><img src="/files/CfsmgFvHbClkHK5cQbIL" alt=""><figcaption></figcaption></figure>

### Filter DMRs

We recommend filtering DMRs by hypo- or hypermethylation status, using the diff.Methy statistics, to give the necessary context of which pathways are hypo- or hypermethylated from the differential comparison. To filter DMR results to hypermethylated DMRs,

* Click **A vs B (DMR)** node.
* Click **Filtering** section in the context-sensitive task menu.
* Click **Differential analysis filter**.
* Choose **Metadata** as Filter type.
* In Filter criteria section, set Filter features by **include A vs B: diff.Methy > 0.2**, then click **Finish**.

<figure><img src="/files/FR51CLXqPkBl0I9Yu8NI" alt=""><figcaption></figcaption></figure>

This generates a Filtered features list node that contains DMRs passing the filtering criteria. Same steps can be applied to generate a filtered list of hypomethylated DMRs, by setting filtering criteria to include regions with diff.Methy statistics < -0.2. The filtering threshold can be adjusted, more filtering criteria can be defined, based on your research questions.

### Annotate DMRs

Next, we are going to annotate the filtered DMRs list with genes information using an annotation model.

* Click **Filtered feature list** node.
* Click **Region analysis** section in the context-sensitive task menu.
* Click **Annotate regions**.
* Assembly for this demo dataset should be **Homo sapiens (human) - hg38**, choose **GENCODE Genes - release 44** as Annotation model, keep the remaining settings as default, click **Finish**.

<figure><img src="/files/ePKH0PCcCZYz9tMxIIZO" alt=""><figcaption></figcaption></figure>

When completed, double click **Annotated regions** node to open the annotation report. The annotation report shows a pie chart on gene section breakdown for the DMRs, and a table where each row is a DMR, columns are the annotated gene information.

* Click **Optional columns** on the top right of the table, tick **gene\_name** checkbox to display gene name in the table.

<figure><img src="/files/f9ST6GtY7Oec7pi9GxFP" alt=""><figcaption></figcaption></figure>

### Gene Set Enrichment <a href="#gene-set-enrichment" id="gene-set-enrichment"></a>

Gene set enrichment analysis identifies gene sets and pathways that are over-represented in a list of significant genes, providing clues to the biological meaning of your results.

* Click **Annotated regions** node.
* Click **Biological interpretation** section in the context-sensitive task menu.
* Click **Gene set enrichment**.
* Select **KEGG database** as **Database** for pathway enrichment analysis. Choose **Homo sapiens hsa\_v12\_25\_04\_07** from the **KEGG database** dropdown.
* At Feature identifier section, tick **Select feature identifier** checkbox, select **gene\_name**, then click **Finish**.

<figure><img src="/files/xYAydHw2DENBw7LAQkh2" alt=""><figcaption></figcaption></figure>

When completed, double click **Pathway enrichment** node to open the pathway enrichment report. Each row in the report is a pathway, with an enrichment score and p-value. It also lists how many genes in the pathway were in the input gene list and how many were not. Click on the pathway ID in the first column to view the pathway diagram. On the pathway diagram, click on a gene name links to KEGG page for additional details.

<figure><img src="/files/tGiNJii9qXGJx1tKxZZl" alt=""><figcaption></figcaption></figure>

<figure><img src="/files/RYBgSxJtEm0bhDF7OClT" alt=""><figcaption></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.multiomics.illumina.com/end-to-end-tutorials/multiomics-workflows/5-base-dna.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
