Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Connected Multiomics provides rigorous statistics methods to perform on the data, tasks in this section includes:
Powerful Partek Flow statistical analysis tools help identify differential expression patterns in the dataset. These can take into account a wide variety of data types and experimental designs.
Welch's ANOVA
The algorithm details for DESeq2 can be found at the external .
If the value of the raw count includes a decimal fraction, the value will be rounded to an integer before DESeq2 is performed. Before you run this task, we recommend that you first remove (filter out) features expressed at a low level and then perform normalization using Median ratio (DESeq2 only).
Note: DESeq2 differential analysis can only be performed on the following normalization output data node, those methods can produce library sizes:
TMM, CPM, Upper Quartile, Median ratio, Postcounts
Both Kruskal-Wallis and Wilcoxon tests are rank tests, such rank-based tests are generally advised for use with larger sample sizes. They both can only take one factor into account at a time. Kruskal-Wallis can perform on an attribute with two or more subgroups.
Wilcoxon test is a close alternative to Kruskal-wallis task, match the results of . This test is also called "Wilcoxon Rank-Sum Test" or "Mann-Whitney U Test". When you perform comparisons on the two groups, it will filter only include the two groups first and then perform the differential analysis.
To invoke the Kruskal-Wallis test, select any count-based data nodes, these include:
If the task report is produced, but the results are missing for some features represented by "?", it may be because something went wrong with the estimation procedure. To better understand this, use the information available in the Extra details report . This type of information is present for many tasks, including Differential Analysis and Survival Analysis.
Click the Extra details report for the feature of interest. This will display the Extra details report.
When the estimation procedure fails, a red triangle will be present next to the information criteria value. Hover over the triangle to see a detailed error message.
In many cases, estimation failure is due to low expression, filter out low expression features or choose a reasonable normalization method will resolve this issue.
Sometimes the estimation results are not missing but the reported values look inadequate. If this is the case, the Extra details report may show that the estimation procedure generated a warning, and the triangle is yellow. To remove suspicious results in the report, set Use only reliable estimation results
In DESeq2 advanced options configure dialog, there is reference selection option:
A reference level is specified for each categorical factor in the model and the result may be dependent on the choice. In R, the reference level is typically chosen by default whenever a categorical factor is present in the model. This Connected Multiomics option was created to allow the user to specify exactly the same reference level as in the R script, if need be e.g. compare the results with R.
The report produced by DESeq2 is similar to the ANOVA report; each row is a feature and columns include p-value, FDR p-value, and fold change in linear scale for each contrast.
In R, shrinkage of log2 fold changes is a separate step performed by lfcShrink() function. In Connected Multiomics, it mplements the shrinkage method corresponding to “ashr” option in lfcShrink(). The default shrinkage option in lfcShrink is “apeglm”, but the default method is unable produce results for some comparisons whereas “ashr” has no restrictions. The fold change shrinkage results are produced in “Shrunken Log2(Ratio)” and “s-value” columns in DESeq2 task project report.
In addition to the issues addressed in Differential Analysis, DESeq2 may generate missing values in the multiplicity adjustment columns (such as FDR) if "independent filtering" is enabled in Advanced Options:
"Independent filtering" tries removing some features with low expression in order to increase the statistical power. For such removed features, the p-value is reported but FDR and similar multiplicity adjustment measures are set to "?". In order to avoid the missing values in the report, set the option to "No".
Love MI, Huber W, and Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 2014;15(12): 550.
Gene counts
Transcript counts
Normalized counts
After clicking on the chosen node:
Select Statistics > Differential analysis in the context-sensitive menu.
Select Kruskal-Wallis or Wilcoxon.
Select a specific factor for analysis and click the Next button to setup the comparisons.
Define the comparisons by dragging and dropping each group in the Denominator and Numerator boxes and click Add comparison
If the data has not been filtered upstream the Low value filter box will be checked by default. Similarly, the Default normalisation will be selected if the software detects that the data as not been previously normalised.
If there are tied ranks of feature expression values, the default is not use tie correction which is corresponding to the scanpy.tl.rank_genes_groups(tie_correct = False).
The results of the analysis will appear similar to other ANOVA/LIMMA-trend/LIMMA-voom. However, the column to indicate mean expression levels for each group will display the median instead for Kruskal-Wallis.
To see the results for as many features as possible, regardless of how reliable they are, set Use only reliable estimation results to No and the result will be reported unless there is an estimation failure. For example, DESeq2 uses Cook’s distances to flag features with outlying expression values and if “Use reliable results” is set to Yes the p-values for such features are not reported which may lead to some missing values in the report (set Use only reliable estimation results to No to avoid this).


This task can be invoked from count matrix data node or clustering task report (Statistics > Compute biomarkers). It performs Student's t-tests on the selected attribute, comparing one subgroup at a time vs all the others combined. By default, the up-regulated genes are reported as biomarkers.
In the set-up dialog, select the attribute from the drop down list. The available attributes are categorical attributes which can be seen on the Data tab (i.e. project-level attributes) as well as and data node-specific annotation.
By default, the result outputs the features that are up-regulated by at least 1.5 fold change (in linear scale) for each subgroup comparing to the others.
The result is displayed in a table with each column is a subgroup name, each row is a feature. Features are ranked by the ascending p-values within each sub-category. If a subgroup has fewer biomarkers than the others, the "extra" fields for that subgroup will be left blank.
The report table only display the top 10 biomarkers for each cluster are shown. Download link provides the full results table
Furthermore, the Download link (upper-left corner of the table report) downloads a .txt file to the local computer (default file name: Biomarkers.txt), which contains the full report: all the genes with fold change > 1.5, with corresponding fold change and p-values.
Spatially variable genes (SVGs) are genes whose expression patterns vary significantly across different spatial locations within a tissue. Identifying SVGs is a key step in spatial transcriptomics analysis and can reveal biologically meaningful spatial expression patterns, tissue architecture, cell-type niches, or gradients of signaling molecules.
We have implemented PROST, a highly scalable algorithm for SVG detection [1].
The task can be invoked on any non-normalised node containing spatial data, we recommend filtering cells and genes before running the task. The analysis is species-agnostic.
Select the appropriate node and click on 'Statistics>Spatially Variable Genes'
Edit the task settings as necessary:
Note the percentage parameter can be adjusted depending on dataset size, this may affect running times.
Adjust the advanced settings as needed:
We currently recommend using 2,000 HVGs for PROST calculation in order to make the task more scalable on large datasets. This parameter can be increased to include all genes in the data, but may severely affect performance.
Click Finish to run the task.
Once the task has completed you will nee a new 'Spatially variable genes' node on the task graph:
Clicking twice on the node will open the task report, a table of the most significant SVGs identified:
The table contains the genes identified as SVGs and the PROST Index (PI) per feature, an indicator of spatial variability [1]. The results can be downloaded as a table, or visualised in the Data viewer. Here is an example:
[1] Liang, Yuchen, et al. "PROST: quantitative identification of spatially variable genes and domain detection in spatial transcriptomics." Nature Communications 15.1 (2024): 600.
Welche's ANOVA is similar to one-way ANOVA, but it doesn't have equal variance assumption. It applies to data with an attribute has more than two groups, but the variances among the groups are not equal. It is based on weighted means. When the attribute has only two groups, it is equivalent to the unequal variance t-Test (also known as Welch's t-Test).
To invoke Welch's ANOVA, select any count-based data nodes, these include:
Gene counts
Transcript counts
Normalized counts
After clicking on the chosen node:
Select Statistics > Differential analysis in the context-sensitive menu.
Select Welch's ANOVA.
Select a specific factor for analysis and click the Next button to setup the comparisons.
Define the comparisons by dragging and dropping each group in the Denominator and Numerator boxes and click Add comparison
If the data has not been filtered upstream the Low value filter box will be checked by default. Similarly, the Default normalisation will be selected if the software detects that the data as not been previously normalised.
Advanced options configuration allows to choose different multiple test correct and report options
The results of the analysis will appear similar to other . However, the column to indicate mean expression levels for each group will display the median instead for Kruskal-Wallis.
















Hurdle model is a statistical test for differential analysis that utilizes a two-part model, a discrete (logistic) part for modeling zero vs. non-zero counts and a continuous (log-normal) part for modeling the distribution of non-zero counts. In RNA-Seq data, this can be thought of as the discrete part modeling whether or not the gene is expressed and the continuous part modeling how much it is expressed if it is expressed. Hurdle model is well suited to data sets where features have very many zero values, such as single cell RNA-Seq data.
On default settings, Hurdle model is equivalent to MAST, a published differential analysis tool designed for single cell RNA-Seq data that uses a hurdle model [1].
We recommend normalizing you data prior to running Hurdle model, but it can be invoked on any counts data node.
Click the counts data node
Click the Differential analysis section in the toolbox
Click Hurdle model
Select the factors and interactions to include in the statistical test
Numeric and categorical attributes can be added as factors. To add attributes as factors, check the attribute check boxes and click Add factors. To add interactions between attributes, select at least two attributes by clicking check boxes and click Add interaction.
Click Next
Define comparisons between factor or interaction levels
Adding comparisons in Hurdle model uses the same interface as. Start by choosing a factor or interaction from the Factor drop-down list. The levels of the factor or interaction will appear in the left-hand panel. Select levels in the panel on the left and click the > arrow buttons to add them to the top or bottom panels on the right. The control level(s) should be added to the bottom box and the experimental level(s) should be added to the top box. Click Add comparison to add the comparison to the Comparisons table. Only comparisons in the Comparisons table will be included in the statistical test.
Click Finish to run the statistical test
Hurdle model produces a Feature list task node. The results table and options are the same as. The percentage of cells where the feature is detected (value is above the background threshold) in different groups (Pct(group1), Pct(group2)) are calculated and included in the Hurdle model report.
Multiple test correction can be performed on the p-values of each comparison, with FDR step-up being the default. If you check the Storey q-value, an extra column with q-values will be added to the report.
There are situations when a model estimation procedure does not fail outright, but still encounters some difficulties. In this case, it can even generate p-value and fold change on the comparisons, but they are not reliable, i.e. they can be misleading. Therefore, the default of Use only reliable estimation results is set Yes.
Shows the current scale of the input data for this task
Set the threshold for a feature to be considered expressed for the two-part hurdle model. If the feature value is greater than the specified value, it is considered expressed. If the upstream data node contains log-transformed values, be sure to specify the value on the same log scale. Default value is 0.
Applies shrinkage to the error variance in the continuous (log-normal) part of the hurdle model. The error term variance will be shrunk towards a common value and a shrinkage plot will be produced on the task report page if enable. Default is Enabled.
Applies shrinkage to the regression coefficients in the discrete (logistic) part of the hurdle model. The initial versions of MAST contained a bug that was fixed in its R source in March 2020. However, for the sake of reproducibility the fix was released only on a topic branch in MAST Github [2] and the default version of MAST remained as is. To install the fixed version of MAST in R, run the following R script.
In Connected Multiomics, the user can switch between the fixed and default version by selecting Fixed version or Default version, respectively. To disable the shrinkage altogether, choose Disabled.
[1] Finak, G., McDavid, A., Yajima, M., Deng, J., Gersuk, V., Shalek, A. K., ... & Linsley, P. S. (2015). MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome biology, 16(1), 278.
[2] MAST topic branch that contains the regression coefficient shrinkage fix:
Differential methylation is used to detect differentially methylated CpG loci (DML) or regions (DMR) between two conditions. The method is based on Bioconductor package DSS (Dispersion Shrinkage for Sequencing data), it is a count-based test. Detailed implementation can be found here.
Current DSS implementation in the Connected Multiomics supports DML/DMR detection from two-group comparison. DML/DMR detection from general experiment design and experiment without replicates are not supported.
This task can be invoked from the imported 5-base Methylation data node, which contains total read count and methylated read count for each CpG site.
Click on 5-base Methylation data node, choose Statistics > Differential Methylation
Click Next. Select a categorical factor that has the two groups to compare and click Next
Setup the comparison(s) based on the factor selected:
The subgroups of the factor are displayed in the left panel; click to select one and move it to one of the boxes on the right. The difference calculation on the comparison will use the group in the top box minus the group in the bottom box. The dialog setup is similar to ANOVA/LIMM-trend/LIMM-Voom.
Click on Configure in Advanced options, smooth span can be customized, the default value is 500. p-value for DML and DMR setting will be used to filter the results.
After apply the advanced options, click Finish to run the task.
The task will generate two data nodes: DML and DMR which represent the differential methylation at loci level and region level respectively.
Double click on DML node to open the report:
In this report, each row is a locus which passed the p-value cutoff set in the advanced dialog:
chr: Chromosome where the CpG site is located
pos: Genomic base pair location of the CpG site
pval: Raw p-value from the Wald test for the differential methylation at this site
fdr: Adjusted p-value based on Benjamini-Hochberg method
The left filter panel usage is the same as
Double click to open DMR report. This result is based on DML results
In this report, each row is a region of a cluster of CpG loci that show consistent differential methylation between the two groups.
chr: Chromosome where the region is located
start: Start position of the region in base pairs
end: Stop position of the region in base pairs
length: Length of the region in base pairs
Feng, Hao, Karen N Conneely, and Hao Wu. 2014. “A Bayesian Hierarchical Model to Detect Differentially Methylated Loci from Single Nucleotide Resolution Sequencing Data.” Nucleic Acids Research 42 (8): e69–e69.
Park, Yongseok, and Hao Wu. 2016. “Differential Methylation Analysis for Bs-Seq Data Under General Experimental Design.” Bioinformatics 32 (10): 1446–53.
Wu, Hao, Chi Wang, and Zhijin Wu. 2012. “A New Shrinkage Estimator for Dispersion Improves Differential Expression Detection in Rna-Seq Data.” Biostatistics 14 (2): 232–43.
Wu, Hao, Tianlei Xu, Hao Feng, Li Chen, Ben Li, Bing Yao, Zhaohui Qin, Peng Jin, and Karen N Conneely. 2015. “Detection of Differentially Methylated Regions from Whole-Genome Bisulfite Sequencing Data Without Replicates.” Nucleic Acids Research 43 (21): e141–e141.
ANOVA method is applying a specified log normal model to all the features.
Descriptive statistics task can be invoked on matrix data node e.g. Gene Counts, Normalized Counts data node in bulk RNA seq analysis pipeline or Single Cell counts Data node etc. It calculates measures of central tendency and variability on observations or features of the matrix data.
Click on a matrix data node
Correlation analysis is used to test the relationship of two numeric variables. It determines the strength and direction of the association between them. The methods included in Connected Multiomics are linear correlation like Pearson's correlation and rank correlation like Spearman's rank correlation, Kendall's Tau correlation.
There are four type of format used to compute correlation:
diff: Difference in methylation level between the groups. Positive values indicate higher methylation in group 1; negative value indicate higher methylation in group 2.
mu1 and mu2: Average methylation level in group 1 and group 2
diff.se: Is the standard error of the estimated methylation difference between the two groups
stat: Wald test statistics used to assess significance of methylation difference
phi1 and phi2: Is dispersion parameter estimated for group 1 and group2. It represents the biological variability in methylation level within the group. Higher value indicates more variability within the group.
postprob.overThreshold: Posterior probability that the methylation difference between the two groups exceeds a specified threshold--delta. Default the delta is 0
nCG: Number of CpG sties within the region
abs(areaStat): Absolute value of the areaStat. Large value indicates strong evidence of differential methylation
diff.Methy: Difference in average methylation between the two groups
meanMethy1 and meanMethy2: Average methylation level across the region in group 1 and group2 respectively.
areaStat: Sum of the test statistics (stat in DML) across all the CpG sites in the region.









\# Uninstall the default version of MAST, if it's installed.
remove.packages("MAST")\# Install devtools, if it's not installed yet.
library("devtools")install\_github("[https://github.com/RGLab/MAST/tree/fix/bayesglm](https://github.com/RGLab/MAST/tree/fix/bayesglm)")library(MAST)To setup ANOVA model or the alternative Welch's ANOVA (which is used on normally distributed data that violates the assumption of homogeneity of variance), select factors from sample attribute. The factors can be categorical or numeric attribute. Click on a check button to select and click Add factors button to add it to the model.
LIMMA-trend and LIMMA-voom setup dialogs are identical to ANOVA's setup.
Note: LIMMA-voom method can only be invoked on the following normalization output data node, those methods can produce library sizes:
TMM, CPM, Upper Quartile, Median ratio, Postcounts
When more than one factor is selected, click Add interaction button to add interaction term of the selected factors.
Once a factor is added to the model, you can specify whether the factor is a random effect (check Random check box) or not.
Most factors in an analysis of variance are fixed factors, i.e. the levels of that factor represent all the levels of interest. Examples of fixed factors include gender, treatment, genotype, etc. However, in experiments that are more complex, a factor can be a random effect, meaning the levels of the factor only represent a random subset of all of the levels of interest. Examples of random effects include subject and batch. Consider the example where one factor is type (with levels normal and diseased), and another factor is subject (the subjects selected for the experiment). In this example, “Type” is a fixed factor since the levels diseased and normal represent all conditions of interest. “Subject”, on the other hand, is a random effect since the subjects are only a random sample of all the levels of that factor. When model has both fixed and random effect, it is called a mixed model.
When more than one factor is added to the model, click on the Cross tabulation link at the bottom to view the relationship between the factors in a different browser tab.
Once the model is set, click on Next button to setup comparisons (contrasts).
Start by choosing a factor or interaction from the Factor drop-down list. The subgroups of the factor or interaction will be displayed in the left panel; click to select one or more level(s) or subgroup name(s) and move them to one of the boxes on the right. The ratio/fold change calculation on the comparison will use the group in the top box as numerator, and the group in the bottom box as the denominator. When multiple levels (groups) are in either numerator or denominator box(es), in Combine mode, click on Add comparison button to combine all numerator levels and combine all denominator levels in a single comparison in the Comparison table below; in Pairwise, click on Add comparison button will split all numerator levels and denominator levels into a factorial set of comparisons – in other words, it will add every numerator level paired with every denominator level comparisons to the Comparison table . Multiple comparisons from different factors can be added from the specified model.
Click on the Configure to customize Advanced options
Multiple test correction sections are the same as the matching GSA advanced option, see above GSA advanced options.
Report option
Use only reliable estimation results: There are situations when a model estimation procedure does not fail outright, but still encounters some difficulties. In this case, it can even generate p-value and fold change on the comparisons, but they are not reliable, i.e. they can be misleading. Therefore, the default of Use only reliable estimation results is set Yes.
Report p-value for effects: If set to No, only the p-value of comparison will be displayed on the report, the p-value of the factors and interaction terms are not shown in the report table. When you choose Yes in addition to the comparison’s p-value, type III p-values are displayed for all the non-random terms in the model.
Shrinkage to error term variance: by default, None is select, which is lognormal model. Limma-trend and Limma-voom options are lognormal with shrinkage. (Limma-trend is the same as the GSA default option–lognormal with shrinkage). Shrinkage options are recommended for small sample size design, no random effects can be included when performing shrinkage. If there are numeric factors in the model, the partial correlations cannot be reported on the numeric factors when shrinkage is performed. Limma-trend works well if the ratio of the largest library size to the smallest is not more than 3 fold, it is simple and robust for any type of data. Limma-voom is recommended for sequencing data when library sizes vary substantially, but it can only be invoked on data node normalized using TMM, CPM, or Upper quartile methods while Limma-trend can be applied to normalized data using any method.
Report partial correlations: If the model has a numeric factor(s), when choosing Yes, partial correlation coefficient(s) of the numeric factor(s) will be displayed in the result table. When choosing No, partial correlation coefficients are not shown.
Data has been log transformed with base: showing the current scale of the input data on this task.
Since there is only one model for all features, so there is no pie charts design models and response distribution information. The Gene list table format is the same as the GSA report.
Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, JRSS, B, 57, 289-300.
Storey JD. (2003) The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics, 31: 2013-2035.
Auer, 2011, A two-stage Poisson model for testing RNA-Seq
Burnham, Anderson, 2010, Model selection and multimodel inference
Law C, Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 2014 15:R29.
Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biology, 2010
Choose Descriptive Statistics in Statistics section of the toolbox
This will invoke the dialog configuration dialog; use it to specify which calculation(s) will be performed on observations or features.
When select the calculation is for observations (samples or cells), there is a drop-down option to use all the features in the input data node or a list of features. If you use a saved feature list, you can use the check button to select whether match of the saved list to your data is case sensitive or not.
When select the calculation is for features, selecting Group by drop-down list allows to compute the statistics in each group separately
Click on the button to add more than one attributes, the result will be on the groups from the interaction terms of selected attributes.
The available statistics are listed on the left panel, suppose "x1, x2, ..., xn"represent an array of numbers
Coefficient of variation (CV): s represent the standard deviation
Geometric mean:
Max:
Mean:
Median: when n is odd, median is , when n is even, median is
Median absolute deviation: , where
Min:
Number of cells: Available when Calculate for is set to Features. Reports the number of cells with the value [<, <=, =, !=, > >=] (select one from the drop down list) than the cut off value entered in the text box. The cut off will be applied to the values present in the input data node, i.e. if invoked on non-normalised data node, the values are raw counts. For instance, use this option if you want to know the number of cells in which each feature was detected; possible filter: Number of cells whose value > 0.0
Percent of cells: Available when Calculate for is set to Features. Reports the number of cells with the value [<, <=, =, !=, > >=] (select one from the drop down list) than the cut off value entered in the text box.
Number of features: Available when Calculate for is set to Cells. Reports the number of features with the value [<, <=, =, !=, > >=] (select one from the drop down list) than the cut off value entered in the text box. The cut off will be applied to the values present in the input data node, i.e. if invoked on non-normalised data node, the values are raw counts. For example, use this option if you want to know the number of detected genes per each cell; filter: Number of features whose value > 0.0
Percent of features: Available when Calculate for is set to Cells. Reports the fraction of features with the value [<, <=, =, !=, > >=] (select one from the drop down list) than the cut off value entered in the text box.
Q1: 25th percentile
Q3: 75th percentile
Range: xmax - x min
Standard deviation: where
Sum:
Variance:
Left click to select measurement and drag to move to the right panel one at a time, or when you mouse over on a measurement, click on the + button to move to the right panel and click Finish.
The output data node can be downloaded or visualized in Data Viewer:
We recommend normalizing you data prior to running Correlation analysis, but it can be invoked on any counts data node.
Click the counts data node
Click the Statistics section in the toolbox
Click Correlation
Choose the method to use for correlation analysis
When select Feature many-to-one correlation option and click Next, it will perform correlation of selected numeric attribute and/or feature vs every feature in the input data node one pair at a time
When multiple numeric factors are added, the correlation analysis will perform each factor with a feature in the data node independently. If you are interested in particular features, use the Search features box to add one or more.
Select factor(s) or feature(s) and click Add factors to include in the statistical test.
Click Next
It is optional to apply a lowest coverage filter or configure the advanced settings
Click Finish to run
Correlation analysis produces a Correlation data node; double-click to open the task report which is similar to the ANOVA/LIMMA-trend/LIMMA-voom and GSA task reports and includes a table with features on rows and statistical results on columns.
Each numeric attribute includes p-value, adjusted p-value columns (FDR step up and/or Storey q-value if included), and a partial correlation value.
Multiple test correction can be performed on the p-values of each comparison, with FDR step-up being the default. If you check the Storey q-value, an extra column with q-values will be added to the report.
There are situations when a model estimation procedure does not fail outright but still encounters some difficulties. In this case, it can even generate p-value and fold change on the comparisons, but they are not reliable, i.e. they can be misleading. Therefore, the default of Use only reliable estimation results is set Yes.
Sets the type of correlation used to calculate the correlation coefficient and p-value. Options are Pearson (linear), Spearman (rank), Kendall (tau). Default is Pearson (linear).
Similarity matrix task is only available on bulk count matrix data node. It is used to compute the correlation of every sample/or feature vs every other sample/or feature. The result is a matrix with the same set of samples/or features on rows and columns, the value in the matrix is correlation coefficient --r.
Select the computation is on samples or features and correlation method:
Pearson: linear correlation:
Spearman: rank correlation:
Kendal: rank correlation:
Click Finish to run the task. The output report of this task can be displayed in heatmap and/or table in the data viewer.
Correlation across assays should be used to perform correlation analysis across different modalities (e.g. ATAC-Seq enriched regions vs. RNA-Seq expression) for multiomics data analysis. It performs correlation analysis of every feature in one assay vs every feature in the other assay. We recommend the two count matrix data node should be filtered only include the features of interest to reduce the computation.
Select the data node to be compared to the node that the task has been invoked from using the Select data node button
Modify any parameters
Click Finish
Features within same chromosome: this option will restrict feature comparison to the chromosome location
All features in one data node vs all features in the other data node: this option will perform the comparison using all combinations without location constraint
P-value: select a cut-off value for significance and only those pairs that meet the criteria will be reported
abs(Correlation coefficient): select a cutoff for reporting the absolute value of the correlation coefficient (represented by the symbol r) where a perfect relationship is 1 and no relationship is 0
Correlation across assays produces a Correlation pair list data node; double-click to open the table. The table can be sorted and filtered using the column titles.
Click View correlation plot to open the correlation plot for each feature pair.
Sample correlation plot is a data visualization used to compare all the features between two samples number of variables across two samples. Sample correlation can be performed on any count matrix data node whether it is raw counts or normalized counts. When the Sample correlation page opens, you will be asked to select two samples for comparison. The sample in the top box will be shown on the X-axis, while the sample in the bottom box will be shown on the Y-axis. Click on the sample names to select different sample and then hit Apply.
A scatterplot is displayed on the right. Each dot is a feature (gene/transcript/protein) while the expression values in the two samples can be read off the coordinate axes, in the same units as present in the data node. Pearson correlation coefficient and regression slope results are displayed on the upper-right corner of the plot
Survival analysis is a branch of statistics that deals with modeling of time-to-event. In the context of “survival,” the most common event studied is death; however, any other important biological event could be analyzed in a similar fashion (e.g., spreading of the primary tumor or occurrence/relapse of disease). Survival analysis tries to answer questions such as: What is the proportion of a population who will survive past a certain time (i.e., what is the 5-year survival rate)? What is the rate at which the event occurs? Do particular characteristics have an impact on survival rates (e.g., are certain genes associated with survival)? Is the 5-year survival rate improved in patients treated by a new drug? Cox regression and Kaplan-Meier analysis are two techniques which are commonly used to assess survival analysis.
In survival analysis, the event should be well-defined with two levels and occur at a specific time. Because the primary outcome of the event is typically unfavorable (e.g., death, metastasis, relapse, etc.), the event is called a “hazard.” The hazard ratio is used to assess the likelihood of the event occurring while controlling for other co-predictors (co-variables/co-factors) if added to the model. In other words, the hazard ratio is how rapidly an event is experienced by comparing the hazard between groups. A hazard ratio greater than 1 indicates a shorter time-to-event (increase in the hazard), a hazard ratio less than 1 is associated with a greater time-to-event (reduction in the hazard), and a hazard ratio of 1 indicates no effect on time-to-event. For example, if the hazard ratio is 2 then there is twice a chance of occurrence compared to the other group. In cancer studies, a hazard ratio greater than 1 is considered a bad prognostic factor while a hazard ratio less than 1 is a good prognostic factor.
An important aspect of survival analysis is “censored” data. Censored data refers to subjects that have not experienced the event being studied. For example, medical studies often focus on survival of patients after treatment so the survival times are recorded during the study period. At the end of the study period, some patients are dead, some patients are alive, and the status of some patients is unknown because they dropped out of the study. Censored data refers to the latter two groups. The patients who survived until the end of the study or those who dropped out of the study have not experienced the study event "death" and are listed as "censored".
Cox regression (Cox proportional-hazards model) tests the effects of factors (predictors) on survival time. Predictors that lower the probability of survival at a given time are called risk factors; predictors that increase the probability of survival at a given time are called protective factors. The Cox proportional-hazards model are similar to a multiple logistic regression that considers time-to-event rather than simply whether an event occurred or not. Cox regression should not be used for a small sample size because the events could accidently concentrate into one of the cohorts which will not produce meaningful results.
Open the Cox Regression task in the task menu under Statistics for any counts node.
Next, select the Time, Event, and Event status. Connected Multiomics will automatically guess factors that might be appropriate for these options . Click Next to proceed with the task.
The predictors (factors or variables) and co-predictors in the model must be defined. Co-predictors are numeric or categorical factors that will be included in the cox regression model. Time-to-event will be performed on features (e.g. genes) by default unless Use feature expression as predictor is unchecked. If unchecked, select a factor and Add factors that is not features to model a different variable. Using the default setting, Use feature expression as predictor, lets the user Add factors to the model that act to explain the relationship for time-to-event (co-predictor) in addition to features. Choose Add interaction to add co-predictors with known dependencies. If factors are added here, they cannot be added as stratification factors. Click Next to proceed with the task.
Next, the user can define comparisons for the co-predictors if they have been added. Configure contrasts by moving factors into the numerator (e.g. experimental factor) or denominator (e.g. control factor / reference), choose Combine or Pairwise, and add the comparison which will be displayed below. Combine all numerator levels and combine all denominator levels in a single comparison or choose Pairwise to split all numerator levels and split all denominator levels into a factorial set of comparisons meaning every numerator will be paired with every denominator. Multiple comparisons from different factors can be added with Add comparison. Low value filter can be used to filter by excluding features; choose a filter or select none. Click Next to proceed with the task.
The user can select categorical factors to perform stratification if needed. Stratification is needed because the proportional odds assumption holds only within each stratum, but not across the strata. When stratification factors are included, the proportional hazard assumption will hold for each combination of levels of stratification factor; a separate submodel is estimated for each level combination and the results are aggregated. Click Finish to complete the task.
The results of Cox regression analysis provide key information to interpret, including:
Hazard ratio (HR): if the HR = 0.5 then half as many patients are experiencing the event compared to the control group, if the HR = 1 the event rates are the same in both groups, and if the HR = 2 then twice as many are experiencing an event compared to the control group.
HR limit: this is the confidence interval of the hazard ratio.
(e.g. If you have selected both a co-predictor and strata factor then a comparison using the co-predictors and Type III p-value for the co-predictor will be generated in the Cox regression report.)
The Kaplan-Meier task is used for comparing the survival curves among two or more groups of samples. The groups are defined by one or more categorical attributes (factors) specified by the user. Like in the case of Cox Regression, it is possible to use feature expression data, if available. In that case, quantitative feature expression is converted into a feature-specific categorical attribute. Each combination of the attribute levels corresponds to a distinct group. If one selects three factors with 2, 3 and 5 levels, respectively, then the total count of compared groups is 2*3*5 = 30. Therefore, selecting too many factors and/or factors with many levels may not work since the total number of samples may be not enough to fill all of the groups.
To perform Kaplan-Meier survival analysis, at least two pieces of information must be provided for each sample: time-to-event (a numeric factor) and event status (categorical factor with two levels). Time-to-event indicates the time elapsed between the enrollment of a subject in the study and the occurrence of the event. Event status indicates whether the event occurred or the subject was censored (did not experience the event). The survival curve is not straight lines connecting each point, instead a staircase pattern is used. The event status will determine the staircase pattern where each drop in the staircase represents the event occurrence.
The Kaplan-Meier task begins similar to the Cox regression task, then differs when selecting categorical attributes to define the compared groups.
For each feature (e.g. gene), the expression values are sorted in ascending order and placed into B bins of (roughly) equal size. As a result, a feature-specific categorical attribute with B levels is constructed which can be used by itself or in combination with other categorical attributes. For instance, for B = 2 , we take a given feature and compute its median expression. The samples are separated into two bins, depending on whether the expression in the sample is below or above the median. if two percentiles are chosen, the bins are automatically labeled "Low" and "High" but the text box can be used to re-label the bins. The bins are feature-specific since this procedure is repeated for each feature separately.
For each group, the survival curve (aka survival function) is estimated using Kaplan-Meier estimator [1]. For instance, if one selects ER status which has two levels and we choose two feature expression bins, four survival curves are displayed in the Data Viewer. The Grouping configuration option can be used to split and modify the connections.
To see whether the survival curves are statistically different, Kaplan-Meier task runs Log-rank and Wilcoxon (aka Wilcoxon-Gehan) tests. The null hypothesis is that the survival curves do not differ among the groups (the computational details are available in [2]). When feature expression is used, the p-values are also feature specific. Select the step-plot icon under View to visualize the Kaplan-Meier survival curves for each gene.
Like in Cox Regression task, it is possible to choose stratification factor(s), but the purpose and meaning of stratification are not the same as in Cox Regression. Suppose we want to compare the survival among the four groups defined by the two levels of ER status and the two bins of feature expression. We can select the two factors on “Select group factor(s)” page. In that case, the reported p-values will reflect the statistical difference among the four survival curves that are due to both ER status and the feature expression. Imagine that our primary interest is the effect of feature expression on survival. Although ER status can be important and therefore should be included in the model, we want to know whether the effect of feature expression is significant after the contribution of ER status is taken into account. In other words, the goal is to treat ER status as a nuisance factor and the binned feature expression as a factor of interest.
In qualitative terms, it is possible to obtain an answer if we group the survival curves by the level of ER status. This can be achieved in the Data Viewer by choosing Grouping > Split by under Configure. That makes it easy to compare the survival curves that have the same level of ER status and avoid the comparison of curves across different levels of ER status.
If in the plot, we see one or more subplots where the survival curves differ a lot, that is evidence that the feature expression affects the survival even after adjusting for the contribution of ER status. To obtain an answer in terms of adjusted Log-rank and Wilcoxon p-values, one should deselect ER status as a “group factor” and mark it as a stratification factor instead.
The computation of stratification adjusted p-values is elaborated in [2].
Suppose when the feature expression and ER status are selected as “group factors”, Log-rank p-value is 0.001, and when ER status is marked as stratification factor, the p-value becomes 0.70. This means that ER status is very useful for explaining the difference in survival while the feature factor is of no use if ER status is already in the model. In other words, the marginal contribution of the binned expression factor is low.
If more than two attributes are present, it is possible to measure the marginal contribution of any single factor in a similar manner: the attribute of interest should be selected as “group factor” and the other attributes should be marked as stratification factors. There is no limit on the count of factors that can be selected as “group” or stratification, except that all of the selected factors are involved in defining the groups and the groups should contain enough samples (at least, be non-empty) for the results to be reliable.
[1] Kaplan-Meier (product limit) estimator:
[2] Klein, Moeschberger (1997), Survival Analysis: Techniques for Censored and Truncated Data. ISBN-13: 978-0387948294




















<figure><img src="../../../../.gitbook/assets/image (152).png" alt=""><figcaption></figcaption></figure><figure><img src="../../../../.gitbook/assets/image (153).png" alt=""><figcaption></figcaption></figure><figure><img src="../../../../.gitbook/assets/image (154).png" alt=""><figcaption></figcaption></figure><figure><img src="../../../../.gitbook/assets/image (155).png" alt=""><figcaption></figcaption></figure>
















Poisson regression is a type generalized linear model (GLM). It assume Poisson distribution where the mean equals the variance, it allows to detect differential expression on the groups with no replicates, e.g. one treated sample vs one control sample.
Negative binomial regression is GLM too. It is used for data are over dispersed, which means the variance is greater than the mean.
GSA stands for gene specific analysis, the goal of which is to identify the statistical model that is the best for a specific gene among all the selected models, and then use that best model to calculate p-value and fold change.
The first step of GSA is to choose which attributes to include in the test. All sample attributes including numeric and categorical attributes are displayed in the dialog, so use the check button to select between them. An experiment with two attributes Cell type (with groups A and B)and Time (time points 0, 5, 10) is used as an example in this section.
Click Next to display the levels of each attribute to be selected for sub-group comparisons (contrasts).
To compare A vs. B, select A for Cell type on the top, B for Cell type on the bottom and click Add comparison. The specified comparison is added to the table below
To compare Time point 5 vs. 0, select 5 for Time on the top, 0 for Time on the bottom, and click Add comparison.
To compare cell types at a certain time point, e.g. time point 5, select A and 5 on the top, and B and 5 on the bottom. Thereafter click Add comparison.
Multiple comparisons can be computed in one GSA run; Below example shows the above three comparisons are added in the computation.
In terms of design pool, i.e. choices of model designs to select from, two 2 factors in this example data will lead to seven possibilities in the design pool:
Cell type
Time
Cell type, Time
Cell type, Cell type * Time
In GSA, if a 2nd order interaction term is present in the design, then all first order terms must be present, which means, if Cell type * Time interaction is present, the two factors must be included in the model. In the other words, the following designs are not considered:
Cell type, Cell type * Time
Time, Cell type * Time
Cell type * Time
If a comparison is added, some models that don't have the comparison factors will also be eliminated. E.g. if a comparison on Cell type A vs. B is added, only designs that have Cell type factor included will be in the computation. These are:
Cell type
Cell type, Time
Cell type, Time, Cell type * Time
The more comparisons on different terms are added, the fewer models will be included in the computation. If the following comparisons are added in one GSA run:
A vs B (Cell type)
5 vs 0 (Time)
only the following two models will be computed:
Cell type, Time
Cell type, Time, Cell type * Time
If comparisons on all the three terms are added in one GSA run:
A vs B (Cell type)
5 vs 0 (Time)
A*5 vs B*5 (Cell type * Time)
then only one model will be computed:
Cell type, Time, Cell type * Time
Click on Configure to customize Advanced options.
Multiple test correction can be performed on the p-values of each comparison, with FDR step-up being the default (1). Other options like Storey q-value (2), and Bonferroni are provided, select one method at a time; None means no multiple test correct will be performed.
FDR step-up:
Suppose there are n p-values (n is the number of features). The p-values are sorted by ascending order and m represents the rank of a p-value. The calculation compares p-value*(n/m) with the specified alpha level, and the cut-off p-value is the one that generates the last product that is less than the alpha level. The goal of step up method is to find:
Define the step-up value as:
Then an equivalent definition for K* is:
So when
the step up value is
In order to find K* , start with Sn* and then go up the list until you find the first step up value that is less or equal to alpha.
Storey q-value:
q-value is the minimum "positive false discovery rate" (pFDR) that can occur when rejecting a statistic.
For an observed statistic T=t and nested set of rejection area {C},
Bonferroni:
Suppose there are n p-values (n is the number of features), the expected number of Type I errors would be given by , thus the significance level of each individual test should be adjusted to . Alternatively the p-values should be adjusted as pB=p*n, pB is Bonferroni corrected p-value. If pB is greater than 1, it is set to 1
This section configures how to select the best model for a feature. There are two options for Model selection criterion: AICc (Akaike Information Criterion corrected) and AIC (Akaike Information Criterion). AICc is recommended for small sample size, while AIC is recommended for medium and large sample size What about large samples?(3). Note that when sample size grows from small to medium, AICc converges to AIC. Taking the AICc/AIC value into account, GSA considers the model with the lowest information criterion as the best choice.
In the results, the best model's Akaike weight is also generated. The model's weight is interpreted as the probability that the model would be picked as the best if the study were reproduced. The range of Akaike weight is between 0 to 1, where 1 means the best model is very superior to the other candidates from the model pool; if the best model's Akaike weight is close to 0.5 on the other hand, it means the best model is likely to be replaced by other candidates if the study were reproduced. One still uses the best shot model, however, the accuracy of the best shot is fairly low.
The default value for Enable multimodel approach is Yes. It means that the estimation will utilize all models in the pool by assigning weights to them based on AIC or AICc. If No is selected instead, the estimation is based on only one best model which has the smallest AIC or AICc.
The output p-value will be different depending on the selected option for multimodel, but the fold change is the same. Multimodel approach is recommended when the best model's Akaike weight is not close to 1, meaning that the best model is not compelling.
By default, Display shrinkage plots is set to No in the report. Choose Yes to display shrinkage plots in the report.
There are situations when a model estimation procedure does not outright fail, but still encounters some difficulties. In this case, it can even generate p-value and fold change for the comparisons, but those values are not reliable, and can be misleading. It is recommended to use only reliable estimation results, so the default option for Use only reliable estimation results is set Yes.
Connected Multiomics provides five response distribution types for each design model in the pool, namely:
Normal
Lognormal (the same as ANOVA task)
Lognormal with shrinkage (the same as limma-trend)
Negative binomial
We recommend to use lognormal with shrinkage distribution (the default), and an experienced user may want to click on Custom to configure the model type and p-value type.
If multiple distribution types are selected, then the number of total models that is evaluated for each feature is the product of the number of design models and the number of distribution types. In the above example, suppose we have only compared A vs B in Cell type, then the design model pool will have the following three models:
Cell type
Cell type, Time
Cell type, Time, Cell type * Time
If we select Lognormal with shrinkage and Negative binomial, i.e. two distribution types, the best model fit for each feature will be selected from 3 * 2 = 6 models using AIC or AICc.
The design pool can also be restricted by Min error degrees of freedom. When "Model types configuration" is set to Default , this is automated as follows: it is desirable to keep the error degrees of freedom at or above six. Therefore, we automatically set to the largest k, 0 <= k <=6 for which admissible models exist. Admissible model is one that can be estimated given the specified contrasts. In the above example, when we compare A vs B in Cell type, there are three possible design models. The error degree of freedom of model Cell type is largest and the error degree of freedom of model Cell type, Time, Cell type * Time is the smallest:
k(Cell type) > k(Cell type, Time) > k (Cell type, Time, Cell type*Time)
If the sample size is big, k >=6 in all three models, all the models will be evaluated and the best model will be selected for each feature. However, if the sample size is too small, none of the models will have k >=6, then only the model with maximal k will be used in the calculation. If the maximal k happens to be zero, we are forced to use Poisson response distribution only.
There are two types of p-value, F and Wald., Poisson, negative binomial and normal models can generate p-value using either Wald or F statistics. Lognormal models always employ the F statistics; the more replicates in the study, the less the difference between the two options. When there are no replicates, only Poisson can be used to generate p-value using Wald.
Note: Connected Multiomics keeps tracking the log status of the data, and no matter whether GSA is performed on logged data or not, the LSMeans, ratio and fold change calculation are always in linear scale. Ratio is the ratio of the two LSMeans from the two groups in the comparison (left is the numerator, right is the denominator); Fold change is converted from ratio: when ratio is greater than 1, fold change is same as ratio; when ratio is less than one, fold change is -1/ratio. In other words - fold change value is always >=1 or <=-1, there is no fold change value between -1 and 1. When the LSmean of numerator group is greater than that of denominator group, fold change is greater than 1; when LSmean of numerator group is less than denominator group, fold change is less than 1; when the group groups are the same, fold change is 1. Logratio is ratio is log2 transformed, which is equivalent to logfoldchange is some other software.
If there are multiple design models and multiple distribution types included in the calculation, the fraction of genes using each model and type will be displayed as pie charts in the task result.
Feature list with p-value and fold change generated from the best model selected is displayed in a table with other statistical information. By default, the gene list table is sorted by the first p-value column.
The following information is included in the table by default:
Feature ID information: if transcript level analysis was performed, and the annotation file has both transcript and gene level information, both gene ID and transcript ID are displayed. Otherwise, the table shows only the available information.
Each contrast outputs p-value, FDR step up p-value, ratio and fold change in linear scale, LSmean of each group comparison in linear scale
When you click on the Optional columns link on the top-right corner of the table, extra information will be displayed in the table when select:
Total counts: sum of the reads across all the observations from the input data
Maximum counts: maximum number of reads counts across all the observations from the input data.
Geometric mean: geometric mean value of the input counts across all observations.
Arithmetic mean: arithmetic mean value of input counts across all observations.
Click on View extra details report () icon under View section to get more statistical information about the feature. In a case that the task doesn't fail, but certain statistical information is not generated, e.g. p-value and/or fold change of a certain comparison are not generated for some or all feature, click on this icon to get more information by mousing over the read exclamation icon
By clicking on Optional columns, you can retrieve more statistics result information, e.g. Average coverage which is the geometric mean of normalized reads in linear scale across all the samples; fold change lower/upper limits generated from 95% confidence interval; feature annotation information if there are any more annotation fields in the annotation model you specified for quantification, like genomic location, strand information etc.
On the left of each contrast header, there is volcano plot icon (). Select it to display the volcano plot on the chosen contrast.
Feature list filter panel is on the left of the table. Click on the black triangle ( ) to collapse and expand the panel.
Select the check box of the field and specify the cutoff by typing directly or using the slider. Press Enter to apply. After the filter has been applied, the total number of included features will be updated on the top of the panel (Result).
The filtered result can be saved into a filtered data node by selecting the Generate filtered node button at the bottom of the filter panel. Save as managed list can add the list of the features in .
Selecting the Download button at the upper-left corner of the table downloads the table as a text file to the local computer.
If any shrinkage methods were selected for GSA, and display shrinkage plots is set to Yes in the Advanced option configure dialog, a shrinkage plot is generated in the report.
X-axis shows the log2 value of average coverage. The plot helps to determine the threshold of low expression features. If there is an increase before a monotone decrease trend on the left side of the plot, you need to set a higher threshold on the low expression filter.
Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, JRSS, B, 57, 289-300.
Storey JD. (2003) The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics, 31: 2013-2035.
Auer, 2011, A two-stage Poisson model for testing RNA-Seq
Burnham, Anderson, 2010, Model selection and multimodel inference
Time, Cell type * Time
Cell type * Time
Cell type, Time, Cell type * Time
Law C, Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 2014 15:R29.
Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biology, 2010





















