All pages
Powered by GitBook
1 of 5

Loading...

Loading...

Loading...

Loading...

Loading...

Gene Set Enrichment

What is Gene set enrichment?

Enrichment analysis is a technique commonly used to add biological context to a list of genes, such as list of significant genes filtered from differential analysis report. The procedure is based on assigning genes to groups and then finding overrepresented groups in filtered gene lists using a Fisher's exact test.

Running Gene set Enrichment

Gene set enrichment task can be invoked on a differential analysis output (or filtered differential analysis output) data node or filtered count matrix data node. Since the data node including all the features will serve as background, to get a meaningful result, always use a data node containing subset of features to invoke this task. Only gene names will be used in the computation.

  • Click a Feature list data node

  • Click the Biological interpretation section of the toolbox

  • Click Gene set enrichment

  • There are two options for Database. KEGG database requires a special license

  • Gene set database is user defined database. The gene sets available for the current Assembly are listed under the Gene set database drop-down list. The assembly is automatically selected, if possible. If the assembly cannot be detected, you can specify it using the drop-down.

  • Select feature identifier (optional) can be used to specify the feature format (e.g. Gene name, Gene ID, Feature ID).

  • Specify the background gene list (optional) can be used for a feature list. Select the list using the drop-down. .

The background gene list is used as the list of possible genes. By default, this is the genes included in the selected gene set database. If your assay limits the genes that could be detected, you may want to specify a background list.

  • Click Finish to run

The result is stored under an Enrichment task node. To open it, double click on the node or select the respective Task report from the context sensitive menu.

Task report

The report is a table with one gene set (pathway) per row (Gene set column; the column entries are hyperlinks when using the distributed GO gene sets), with the category name in the Description column.

The table are sorted by the p-value column in ascending order (or Enrichment score in descending order), enrichment score is the negative natural logarithm of the p-value (P-value column), which is derived from Fisher's exact test on the underlying contingency table. The higher the enrichment score, the more overrepresented the gene set is within the input list of significant genes. The columns can be searched by typing in the search term in the respective box (and hitting Enter), or sorted by selecting the double arrow icon.

The contingency table can be displayed by selecting the View gene breakdown chart icon on the right.

The term "list" refers to the list of significant genes, while the term "set" refers to the respective pathway. The first row of the contingency table is also seen in the report, namely the Genes in list and Genes not in list columns.

The View extra details () button provides additional information on the pathway.

In addition to the details already given in the report, a full list of Genes in list and Genes not in list can be inspected and downloaded (Download data) to the local computer as a text file. Use the arrow to expand these sections.

Click on the pathway ID hyperlink in the first column to check the gene network:

Interactive KEGG pathway maps

Each rectangle on the map represents a gene product in the pathway. Gene products are mostly proteins coded by a gene or group of genes, but they could be RNA too. Related pathways are shown as large rounded rectangles. Chemical compounds, DNA or other molecules are shown as circles.

Coloring the map

The pathway map is colored by the first fold-change column in the input Feature list data node. The control panel on the left can be used to configure the colors of the pathway map. In all options, rectangles colored white do not have gene information. Options for coloring include:

  • Fixed color: all genes are colored black.

  • Genes in list: all genes in the list are colored, this can be configured. Genes not in the list are black.

  • Statistics in the gene list: .e.g FDR, p-value, Fold change etc. Colors can be customized by clicking on the color square to change.

Feature details

Mousing over a rectangle shows the genes indicated by the rectangle in the tooltip (Figure 8). Genes are listed on rows with all aliases in the KEGG database included on the row. Genes that are in the list and used to color the rectangle is shown in bold.

On KEGG pathway maps that include chemical compounds, the chemical structure is shown in the tooltip on mouse-over.

Clicking a rectangle opens the page for that gene or group of genes on the KEGG website in a new tab in your web browser.

Click the Save image icon on the upper-left of the view to download a PNG file showing the configured KEGG pathway map to your local computer.

Visualizing gene set enrichment results

If the gene set enrichment table has fewer than 100 results (rows), the categories can be visualized in the Data Viewer.

If needed, filter down the number results, for instance by using a cut-off based on the enrichment score. Type in the cut-off value in the text box beneath the Enrichment score and hit enter (an example is shown below). Once the number or results falls below 100, View plots in Data Viewer icon will be displayed. Click link to open a new Data Viewer session.

Two plots are loaded into Data Viewer. Both plots show enrichment score on the horizontal axis and gene ontology categories (i.e. the ones present in the gene enrichment table) on the vertical axis. The plots show enrichments scores (Enrichment score column of the gene ontology table) and - in addition - the plot on the left uses color range to depict enrichment P-value (green = low, red = high P-value).

The same functionality is available for pathway enrichment results.

References

  1. Ashburner M, Ball CA, Blake JA et al. Gene Ontology: tool for the unification of biology. Nat Genetics. 2000; 25:25-29.

  2. The Gene Ontology Consortium. Gene Ontology Consortium: going forward. Nucleic Acids Res. 2015; 43:D1049-1056.Recommended citations from the Geneontology.org website

Click here for more information on List management

Correlation Engine Pathway

What is Correlation Engine pathway analysis?

Correlation Engine (CE) Pathway analysis helps determine if your gene(s) of interest, identified by differential analysis in Connected Multiomics, corresponds to gene or protein sets from the GO consortium, MSigDB, TargetScan, and InterPro.

Running Correlation Engine pathway

Correlation Engine pathway task can be invoked on a differential analysis output data node. The filtered differential analysis output is recommended because it includes the genes of interest between comparisons.

  1. Click a filtered feature list data node

  2. Click the Biological interpretation section of the toolbox

  3. Click Correlation Engine pathway

  4. In the task set up page, select the right options for Organism, Data type, fill in the Project information and hit the Next button

  1. Select at least one or more contrasts of interest for Correlation Engine pathway analysis

  1. Click Finish to run

The result is stored under a Correlation engine node. To open it, double click on the node or select the respective Task report from the context sensitive menu.

Task report

Use the dropdown list to switch between different contrasts. For each contrast, the report is a table with one pathway per row (Gene set column; the column entries are clickable for hyperlinks), with the category name in the Title column. The Taxonomy column tells the database sources, while the Description column provides more information about the pathway.

Illumina has developed the Running Fisher algorithm to perform pathway analysis in CE. See more details about the calculation of Direction, Normalized enrichment score, Enrichment score, and P-value in our technical note:

Visualizing Correlation Engine pathway results

Only if the report table has fewer than 100 pathways (rows), can they be visualized in the Data Viewer.

To make it easier to visualize, ICM includes the “Open Data Viewer auto session” link. By clicking it, a Data Viewer session will open with the top 30 pathways ranked by Normalized enrichment score for the contrast.

Two plots are loaded into the Data Viewer. Both plots show Normalized enrichment score on the horizontal axis and pathways (i.e. the ones present in the gene enrichment table) on the vertical axis. The plots show Normalized enrichments scores (Normalized enrichment score column of the task report table) and - in addition - the plot on the left uses color range to depict enrichment directions (blue = Up, red = Down).

To customize the content plotted, filter down the number of results. Type the value in the text box in the column header and hit enter (an example using a cut-off based on the Normalize enrichment score is shown below). Once the number of results falls below 100, the View plots in Data Viewer icon (“Open Data Viewer custom session”) will be displayed. Click the link to open a new Data Viewer session.

Data Correlation Details: Enrichment Analysis

Gene set ANOVA

Gene set ANOVA allows user to perform 1-way ANOVA to compare different groups at gene set level. This method takes normalized gene expression count matrix, a gene set is a group of genes based on database specified, like GO term, KEGG pathway etc.

Like setup ANOVA model for gene expression analysis, but only one factor can be added to the model. In addition, the following extra terms will be added to the model by the task automatically:

  • Gene ID - Since not all genes in a functional group express at the same level, gene ID is added to the model to account for gene-to-gene differences

  • Factor * Gene ID - Interaction of gene ID with the factor is added to detect changes within the expression of a gene set with respect to different levels of the factor, referred to disruption. For instance, in a gene set, maybe some genes showing up-regulation in treatment group, but some other genes showing down-regulation in the treatment group, we call it gene set disruption.

Running Gene set ANOVA

Select the data node with normalized data and then go to Biological interpretation > Gene set ANOVA

Use the first dialog to specify gene sets database. You can rung gene set ANOVA on pathways (currently based on Kyoto Encyclopedia of Genes and Genomes () pathways) or on other gene set databases. The Gene set size option allows you to restrict your analysis on gene sets of certain size (i.e. number of genes). Make sure the feature identifier in the data contains gene symobl/gene name, which is used to map to the database. Click Next.

Once your choices are made, push Next to proceed.

In the second part of the set up, pick the experimental factor, only one factor can be selected.

Click Next to setup comparisons:

The box on the left side displays the categories of the selected factor (shown as Factor). Use the arrow buttons (>) to move one of the factors to the Denominator box (that factor should be interpreted as the reference category) and the other factor to the Numerator box. Confirm your selection by pushing the Add comparison button and the comparison will be added to the Comparisons table.

Click Finish to run. Each comparison will be performed individually and generate its own section in the report.

Click on the Configure icon to access the advanced options.

Gene Set ANOVA Results

When the task completes, double click on the Gene Set ANOVA node to view the report.

Like report, the report consists of two parts: the GSEA result table on the right and the filter panel on the left.

The comparison (i.e. Denominator vs. Numerator) is given at the top of the table. Each row of the table corresponds to one gene set (pathway) and the gene sets are ranked by the first comparison's p-value in ascending order.

  • View. The icons in the View column open the dot plot () or the extra details report () (explanations below).

  • Gene set ID. The Gene set IDs are based on the gene set file that was selected during set up. Each ID is a link to the details of he selected set.

  • Gene set size. Number of genes in the set (as specified in the gene set file), click on the number to download the list of genes.

Click on the dot plot icon to open the viewer

The plot display the genes of the gene set selected. X-axis represents genes within the gene set, Y-axis represents the mean value of gene expression, each dot represent of the group in the comparison.

Click on the View extra details icon () to open a gene set-specific report page, the model used for the computation is included in this report.

GSEA

GSEA is a bioinformatics tool that determines whether a set of genes (e.g. a gene ontology (GO) group or a pathway) shows statistically significant, concordant differences between two experimental groups (1,2). Briefly, the goal of GSEA is to determine whether the genes belonging to a gene set are randomly distributed throughout the ranked (by expression) list of all the genes that should be taken into consideration (e.g. gene model), or are primarily found at the top or at the bottom of the list.

Prerequisites

To run GSEA, your project has to contain at least one categorical factor with at least two levels (e.g. Treated and Control). If you are running GSEA on RNA-seq data, note that some common normalization transformations, such as fragments/reads per kilobase of transcript per million mapped reads (FPKM/RPKM) or transcripts per million (TPM) are not considered suitable for GSEA. Instead, you should use an approach such as DESeq2 normalisation, trimmed means of M (TMM), or geometric mean.

For each comparison, there are p-value, FDR, ratio, fold change and LSmean of each comparison group reported
  • Disruption: is the factor and gene ID interaction term, p-value and FDR are reported on this term too.

  • KEGG
    ANOVA
    Running GSEA

    To launch GSEA, select the data node with normalised data and then go to Biological interpretation > GSEA

    Use the first dialog to specify gene sets. You can run GSEA on pathways (currently based on Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways) or on other gene set databases. When using the KEGG option, the KEGG database (i.e. the species) is automatically set, based on the upstream nodes. The Gene set size option allows you to restrict your analysis on gene sets of certain size (i.e. number of genes).

    Once your choices are made, push Next to proceed.

    In the second part of the set up, pick the experimental factor for GSEA.

    GSEA task only compute one factor at a time. If you select more than one factors, the computation will be performed on each one individually. Click Next to setup comparisons:

    The box on the left side displays the categories of the selected factor (shown as Factor). Use the arrow buttons (>) to move one of the factors to the Denominator box (that factor should be interpreted as the reference category) and the other factor to the Numerator box. Confirm your selection by pushing the Add comparison button and the comparison will be added to the Comparisons table.

    Low value filter is turned on by default and will remove all the genes with the lowest average coverage of 1.0 or below; if a filter feature task was performed before this task, the default low-value filter is set to None.

    Push Finish to launch GSEA with the default settings. Each comparison will be performed individually and generate its own section in the report.

    Click on the Configure icon to access the advanced options.

    Number of data permutations (needed to calculate the normalised enrichment scores) can be controlled using the Permutations option. Permutation is to randomly permute the group assignment across a given gene. For each permutation, a random order is computed, that order is used to compute the score for each gene. Finally, make sure the input data is in log scale or not.

    GSEA Results

    When the task completes, double click on the GSEA task node to view the report.

    Like ANOVA report, the report consists of two parts: the GSEA result table on the right and the filter panel on the left

    The comparison (i.e. Denominator vs. Numerator) is given at the top of the GSEA table. Each row of the table corresponds to one gene set (pathway) and the gene sets are ranked by the first comparison's normalized enrichment score in descending order.

    • View. The icons in the View column open the enrichment plot () or the extra details report () (explanations below).

    • Gene set ID. The Gene set IDs are based on the gene set file that was selected during set up. Each ID is a link to the details of he selected set.

    • Gene set size. Number of genes in the set (as specified in the gene set file), click on the number to download the list of genes.

    • Enrichment score. The enrichment score is the primary result of GSEA; it reflects the degree to which the current gene set is overrepresented at the top or the bottom of the ranked list of all the genes in the gene model (for details, see the References). The higher the enrichment score the more overrepresented (enriched) the gene set is.

    • Normalised score. Normalisation of the enrichment score takes into account the size of the gene set. We recommend to use normalised values for filtering.

    • P-value. P-value estimates the statistical significance of the enrichment score.

    • FDR. False discovery rate (FDR) is used to control for multiple testing. We recommend to use FDR values for filtering.

    Click on the View enrichment report icon () to open a new Data viewer session with the per gene set report. The selected gene set is in the title, at the top of the canvas (Enrichment profile). To quickly switch to another gene set, use the Axis > Content drop-down list. The individual plots are as follows:

    • Enrichment score. The algorithm walks down the ranked list of all the genes in the model, increasing the running sum (y axis) each time when a gene in the current gene set is encountered. Conversely, the running-sum is decreased each time a gene not in the current gene set is encountered. The magnitude of the increment depends on the correlation of the gene with the experimental factor. The enrichment score is then the maximum deviation from zero encountered in the random walk (the summit of the curve).

    • Gene set hits. Each vertical line shows the location of a gene from the current gene set, within the ranked list of all the genes in the model.

    • Rank metric. The plot shows the value of the ranking metric (y axis) as you move down the ranked list of all the genes in the model (x axis). The ranking metric measures a gene’s correlation with the attribute specified in the comparison.

    Click on the View extra details plot () to open a gene set-specific report page

    Leading edge genes: it is a subset of genes that contribute most to the ES. For a positive ES, the leading edge subset is the set of members that appear in the ranked list prior to the peak score. For a negative ES, it is the set of genes that appear subsequent to the peak score.

    The filter panel is used to narrow the list of gene sets. The Results shows the number of gene sets currently in the table. Filtering can be performed on: Gene set ID (search for the numeric ID), Gene set description (search for a key word), Gene set size (number of genes in the set), Enrichment score, Normalised enrichment score, P-value, FDR. Click on the black triangle to open the controls for each filter. To remove all the filters, click on the Clear filter link.

    Click Generate filtered node button to perform the filter task based on the specified criteria.

    References

    1. Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102(43):15545-15550. doi:10.1073/pnas.0506580102

    2. Mootha VK, Lindgren CM, Eriksson KF, et al. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003;34(3):267-273. doi:10.1038/ng1180

    Biological interpretation

    Connected Multiomics offers biological interpretation tools that can provide additional insight into lists of genes, such as significantly different genes between experimental groups.

    • Gene Set Enrichment

    • GSEA

    Correlation Engine pathway