Only this pageAll pages
Powered by GitBook
Couldn't generate the PDF for 152 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

ICM

About

Loading...

Loading...

INTRODUCTION

Loading...

Loading...

Studies

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Analyses

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Interactive Demos

Explore multiomics data with an interactive click-through demo of Connected Multiomics. Click to expand the selection for easier viewing.

Discover Illumina Multiomics Software

About Illumina Connected Multiomics

Illumina Connected Multiomics is a cloud-based software platform designed for biologists to perform tertiary analysis of multiomics data for research purposes. It enables users to organize and manage biological data into studies, apply statistical methods, reference knowledge sources for biological interpretation, correlate results across various omics data types and modalities, and deliver visualization tools to support result interpretation. This functionality streamlines the process from samples to multiomics insights, accelerating data interpretation and publication efforts.

Additional Multiomics Software from Illumina

Illumina Multiomics Software offers a comprehensive range of solutions to streamline your journey from sample collection to multiomic insights. Discover assays designed for protein analysis, spatial data, single-cell RNA data, and more.

Our powerful study management tool helps organize your data from ingestion to analysis, while our advanced analysis platform generates actionable insights to drive your research forward.

Select your application of interest below to learn about how to use it.

Explore our for an introductory experience of available Multiomics workflows.

You can search our help documentation or ask questions with AI-generated answers using the search-box in the top of the page.

Navigate and explore using the left-panel.

Interactive Demos
Cover

Provides normalized protein counts for NGS-based proteomics

Cover

Processes single-cell RNA-Seq data into gene expression matrices

Cover

Transforms miRNA data into small RNA count matrices

Batch removal

When a study contains multiple libraries, the data might contains variabilities due to technical differences (e.g. sequencing machine, library prep kit etc.) in addition to biological differences (like treatment, genotype etc.). Batch removal is essential to remove the noise and discover biological variations.

  • General linear model

  • Harmony

  • Seurat3 integration

Methylation analysis

Connected Multiomics offers tools to perform analysis on methylation array data and Illumina 5-base DNA Prep data.

  • Detect Differential Methylation

Analysis Functionality

View Studies

Study Overview

A Study is used to manage your samples, metadata, comparison groups, and analyses for a specific set of data.

Card View

In the Studies overview screen, each study is represented as a card. The study card displays the following information:

Information
Description

To switch between card view and list view, click the icons in the top right.

List View

In list view, you'll see a list of all studies in your workgroup. To customize the columns displayed, click the Columns tab on the right.

To filter studies based on specific attributes, click the Filters tab.

Click the refresh icon in the top right to clear your filters and return to the full list.

You can also filter individual columns by clicking the icon next to each column header.

For more options, click the three dots.

Enter Study

When entering a study, you can see study information, data metrics and recent analyses. Studies can be easily updated by adding sample data and meta data. This section will cover:

  • Overview

  • Samples

  • Sample Metadata

Generate group cell counts

If a single cell data node contains cell attribute information, e.g., clustering results, classifications, or imported attributes, a counts-type data node containing the number of cells from each attribute group for each sample can be generated and used for downstream analysis.

To invoke Generate group cell counts:

  • Click a single cell count data node with cell-level attribute information

  • Click Pre-analysis tools in the toolbox

  • Click Generate group cell counts

  • Select the attribute to group the cells from the Group by drop-down menu and click + button

  • Click Finish

A group cell counts node will be generated. The data node contains a matrix of cell counts in each sample for each group. You can view the counts results in the Group cell counts report.

The Group cell counts data node is a counts type data node and downstream analysis tasks, such as normalization, PCA, and ANOVA, can be used to analyze the group cell counts data.

QA/QC

Connected Multiomics contains a number of quality control tools and reports that can be used to evaluate the current status of your analysis and decide downstream steps. Quality control tools are organized under the Quality Assurance / Quality Control (QA/QC) section of the context-sensitive menu and are available for different type of data nodes.

This section will illustrate:

  • Feature distribution

  • Single-cell QA/QC

  • Cell barcode QA/QC

In addition to the tools listed above, many other functionalities can also be interpreted in sense of quality control. For instance, principal components analysis, hierarchical clustering (on sample level), variant detection report, and quantification report.

Impute missing values

This task is to replace missing data in the data with estimated values based on selected method.

First select the computation is based on samples/cells or features, and click Finish to replace missing values. Some functions will generate the same results no matter which transform option is selected, e.g. constant value. Others will generate different results:

  • Constant values: specify a value to replace the missing data

  • Maximum: use maximum value of samples/cells or features to replace missing data depends transform option

  • Mean: use mean value of samples/cells or features to replace missing data depends transform option

  • Median: use median value of samples/cells or features to replace missing data depends transform option

  • Minimum: use minimum value of samples/cells or features to replace missing data depends transform option

  • K-nearest neighbor (mean): specify number of neighbors (N), Euclidean metric is used to compute neighbors, use mean of (N) neighbors to replace missing data

  • K-nearest neighbor (median): specify number of neighbors (N), Euclidean metric is used to compute neighbors, use median of (N) neighbors to replace missing data

Split by attribute

The Split by Attribute task is used to split a data node into different nodes based on the groups in a categorical attribute, each data node only includes the samples/cells from one group. It is a more efficient way to filter samples/cells if you plan to perform downstream analysis on each and every group separately in an attribute.

Click on the data node and select split by attribute from the Filtering section in task menu, select the attribute to split the data on.

Result of the split by attribute task will be two separate data nodes, each contains samples/cells from one gender group.

Statistics

Connected Multiomics provides rigorous statistics methods to perform on the data, tasks in this section includes:

  • Differential analysis

  • Descriptive statistics

  • Correlation

Filtering

Connected Multiomics has the flexibility to subsample your data for further downstream analyses. Filter data by:

  • Filter features

  • Filter samples/cells

  • Split by attribute

Sample Groups

Sample Groups

A sample group refers to a collection or subset of samples that are grouped together based on shared characteristics or attributes.

Create Sample Group

To create a sample group, filter your samples down into only those you want to be part of your sample group. You can filter your data by clicking the filter icon at the top of each visible column or clicking

Enter Analysis

Click into your analysis to see details about your analysis steps.

Analyses

In the tab of your analysis within a study, you'll be able to see a depiction of your analysis pipeline, represented by data nodes and task nodes connected by arrows. Each rectangle represents a task, and each circle represents an output. You can click on each node to view more details on the right-side toolbox.

Double click on each node to see the task details appear in a separate window.

Samples

In the Samples tab, you'll see a list of samples that were added to the study from the ingested sample files, along with their associated metadata attributes. You can search for a sample using the search bar on the left or filter data to create your desired sample groups.

The Columns icon allows you to customize which columns you want to view. Click on a column to add it to the table, and click again to remove it. Click the refresh icon to refresh the sample view after adding more data or to clear your filters and reset to the default view.

You can filter your data by clicking the filter icon at the top of each visible column or clicking Filters icon above the table to filter all columns based on metadata attributes . This is useful when creating sample groups based on specific attributes. You can also drag and drop columns to reorder them.

Create Study

To Create a new study, click in the top right and follow the steps below.

1

Enter a Study Name

Enter a name for your study between 1 and 255 characters in length (including special characters). The study name can be updated later.

2

View Analyses Across Studies

Click on the Analyses tab in the left panel to see a list of all analyses within your workgroup, including those across various studies. Use the Columns and Filters tabs on the right to adjust and refine your view. You can also search for specific analyses using the search bar in the top right. Additionally, you can perform actions on your analyses, such as renaming or deleting an analysis.

Analyses can only be created from within a study. See the section for more information on how to create an analysis.

Normalize to baseline

If your experimental design includes a sample or a group of samples serving as a baseline control, you can normalize the experimental samples by subtracting or dividing by the baseline sample(s) using the Normalize to baseline task in menu. For example, in PCR experiments, the delta Ct values of control samples are subtracted from the delta Ct values of experimental samples to obtain delta-delta Ct values for the experimental samples.

The Normalize to baseline task is available in the Normalization and Scaling section of the context-sensitive menu upon selection of any count matrix data node.

There are three options to choose the baseline samples:

  • use all samples

  • use a group

Detect Differential Methylation

The Detect Differential Methylation task enables users to perform differential test on methylation array data and Illumina 5-base DNA Prep data. The task converts Beta-values to M-values and uses these to perform .

Running Detect Differential Methylation

  • Click to select a methylation data node with beta values.

Annotation/Metadata

This section has tools that are useful in managing and understanding single cell data, especially for downstream analysis. To invoke Annotation/Metadata tools, click on any Single cell counts data node. These include the following tasks:

Annotate cells

If you have attribute information about your cells, you can use the Annotate cells task in Connected Multiomics to apply this information to the data. Once applied, these can be used like any other attributes, and thus can be used for cell selection, classification and differential analysis.

To run Annotate cells:

  • Click a Single cell counts data node

  • Click the Annotation/Metadata section in the toolbox

  • Click

Impute low expression

Single cell RNA-seq gene expression counts are zero inflated due to inefficient mRNA capture. This normalization task is based on MAGIC[1]–MArkov Affinity-based Graph Imputation of Cells), to recover gene expression lost due to drop-out. The limitation on using this method is up to 50K cells in the input data node.

To invoke this task, click on a normalized data node which has less than 50K cells, it will first compute PCA to use the number of PCs specified to impute.

Click Finish to run the task, it will output low expression imputed matrix in the output report node, which can be used for downstream analysis like differential analysis, visualization etc.

References

  1. Dijk D et al. MAGIC: A diffusion-based imputation method reveals gene-gene interactions in single-cell RNA-sequencing data

Cell metadata

Cell level attributes can be used for filtering, cell selection, classification and differential analysis using the . Data nodes in the Analyses pipeline that contain this cell level information (e.g. Graph-based clusters data node) can be published to the project and will then be available in Study > Analysis > Metadata > Manage under Cell attributes where they can be modified and reordered.

Troubleshooting

If the task report is produced, but the results are missing for some features represented by "?", it may be because something went wrong with the estimation procedure. To better understand this, use the information available in the Extra details report . This type of information is present for many tasks, including Differential Analysis and Survival Analysis.

Click the Extra details report for the feature of interest. This will display the Extra details report.

When the estimation procedure fails, a red triangle will be present next to the information criteria value. Hover over the triangle to see a detailed error message.

In many cases, estimation failure is due to low expression, filter out low expression features or choose a reasonable normalization method will resolve this issue.

Sometimes the estimation results are not missing but the reported values look inadequate. If this is the case, the Extra details report may show that the estimation procedure generated a warning, and the triangle is yellow. To remove suspicious results in the report, set Use only reliable estimation results

SVD

To analyze scATAC-seq data, Connected Multiomics introduced a new technique - LSI (latent semantic indexing )[1]. LSI combines steps of frequency-inverse document frequency (TF-IDF) normalization followed by singular value decomposition (SVD). This returns a reduced dimension representation of a matrix. Although SVD and Principal components analysis (PCA) are two different techniques, the SVD has a close connection to PCA. Because PCA is simply an application of the SVD. For users who are more familiar with scRNA-seq, you can think of SVD as analogous to the output of PCA. And similarly, the statistical interpretation of singular values is in the form of variance in the data explained by the various components. The singular values produced by the SVD are in order from largest to smallest and when squared are proportional the amount of variance explained by a given singular vector.

SVD task can be invoked in Exploratory analysis section by clicking any single cell counts data node. We recommend running SVD on the normalized data, particularly the TF-IDF normalized counts for scATAC-seq analysis.

To run SVD task,

  • Click a single cell counts data node

Merge matrices

In complex projects, different data matrices (e.g. observations on rows and features on columns) need to be merged in order to achieve the analysis goals. For example, two cell populations were identified on separate branches of the analysis pipeline and to combine them before any joint downstream steps, the expression matrices have to be combined. Alternatively, two assays (gene expression and protein expression) were performed on the same cells so the expression matrices have to be merged for joint analysis.

Merge matrices task is located in the Pre-analysis tools section of the toolbox and it can handle two scenarios: Merge cells/samples and Merge features. To start, select the first data node on the pipeline (e.g. single cell counts) and then select the Merge matrices task.

Downsample cells

The Downsample Cells task is used to randomly downsample the number of cells in a single cell data set. This task can be used to reduce the size of large single cell datasets to small and manageable sizes for quick analysis. Another use case for this task is for a study with multi samples with each sample having different number of cells. Downsample Cells can be used to randomly select an equal number of cells for all the samples in the study. For the default setting, the sample with the minimum number of cells is used with the number of cells in that sample set as the number of cells to be selected in the other samples. However, this default setting can be changed to a preferred number by the user. If the number selected by the user is greater than the number of cells in one or more samples, those samples will not be downsampled and all the cells in those samples will be returned. If the number selected by the user is greater than the number of cells in all the samples, then none of the samples will be downsampled.

To run a downsample task first click on a single cell count data node. Go to the Filtering section and select Downsample cells task to open the dialogue.

The minimum number of cells in any of the samples is used in the default settings. Or you cand specify the percentile to reduce to for each sample. Click Finish to run the task.

Compute biomarkers

This task can be invoked from count matrix data node or clustering task report (Statistics > Compute biomarkers). It performs Student's t-tests on the selected attribute, comparing one subgroup at a time vs all the others combined. By default, the up-regulated genes are reported as biomarkers.

Compute biomarker dialog

In the set-up dialog, select the attribute from the drop down list. The available attributes are categorical attributes which can be seen on the Data tab (i.e. project-level attributes) as well as and data node-specific annotation.

By default, the result outputs the features that are up-regulated by at least 1.5 fold change (in linear scale) for each subgroup comparing to the others.

Split by feature type

Split matrix can be invoked on any counts data node with more than one feature types (multiomics data). For example, a CITE-Seq experiment would have Gene Expression counts and Antibody Capture counts in the single cell counts data node. This task to split different feature measurements for downstream analysis.

There are no parameters to configure, to run:

  • Click the counts data node you want to split

  • Click the Pre-analysis tools section of the toolbox

Normalization and scaling

To ensure that different data sets are comparable, several normalization and scaling options are available in Connected Multiomics. These include newly-developed algorithms specifically tailored for genomic analysis.

Task actions

Left single clicking on any task (the rectangles) in the analysis pipeline will cause a Task Actions section to appear in the pop-up menu. This allows users to:

  • Rerun tasks: rerun the selected task, the task dialog will pop-up and users can change parameters of the task. Previous downstream analysis of the selected task will not be rerun.

  • Rerun with downstream tasks: rerun the selected task, the task dialog will pop-up, users can change the parameters of the current task and the downstream analysis will be rerun with the same configuration as the previous one.

  • Edit description: the description of the task can be replaced by manually typing in string.

Change color: choose a color to apply only on the selected task by clicking on Apply. Click Apply to downstream to change the selected task and the downstream pipeline color to the newly selected color.

  • Delete task: this option is only available if the user is the owner of the project or the owner of the task. When a task is deleted, all downstream tasks, including tasks from other users, will be deleted. Users may check the box to choose to delete the task's output files. If delete output files is not checked, the task will be removed from the pipeline, but the output files of the task will remain on the disk.

  • Restart task: this option is only available on failed tasks and requires an admin role to perform, but does not require that you have a user account. Since you are logged in as an admin, restarting a task will not take up a concurrent seat and the disk space consumed by the output files will count towards the original owner of the task's storage space.

  • 5-base Methylation QC
    Differential methylation
    Compute biomarkers
    Suvival Analysis with Cox regression and Kaplan-Meier analysis
    Downsample cells

    Attribute report

    Annotate cells
    Annotate features
    Publish cell attributes to project
    Annotate cells

    If you have attribute information about your cells, you can use the Annotate cells task in Connected Multiomics to apply this information to the data. Once applied, these can be used like any other attributes, and thus can be used for cell selection, classification and differential analysis.

    To run Annotate cells:

    • Click a Single cell counts data node

    • Click the Annotation/Metadata section in the toolbox

    • Click Annotate cells

    You will be prompted to specify annotation input options:

    • Single file (all samples): requires one .txt file for all cells in all samples. Each row in the file represents a barcode and at least one barcode column which will match the barcodes in your data. It also requires a column containing Sample ID which must match the Sample name in the Metadata tab of your project.

    • File per sample: requires the format of all of the annotation files to be the same. Each file has barcodes on rows, it requires one barcode column that will match the barcodes in your data in that sample. All files should have the same set of columns, column headers are case sensitive.

    Publish Cell attributes to a project
    Publish Cell attributes to project task

    Scran deconvolution

  • TF-IDF normalization

  • Imput missing values

  • Impute low expression

  • SCTransform

  • Normalization
    Normalization to housekeeping genes
    Normalize to baseline
    Cell Metadata
    Sample Groups
    Analyses and Data Management
    Studies

    The result is displayed in a table with each column is a subgroup name, each row is a feature. Features are ranked by the ascending p-values within each sub-category. If a subgroup has fewer biomarkers than the others, the "extra" fields for that subgroup will be left blank.

    The report table only display the top 10 biomarkers for each cluster are shown. Download link provides the full results table

    Furthermore, the Download link (upper-left corner of the table report) downloads a .txt file to the local computer (default file name: Biomarkers.txt), which contains the full report: all the genes with fold change > 1.5, with corresponding fold change and p-values.

    Study Name

    Name of your study, provided by the user during study creation.

    Omic Types

    Omic Types included in your study, determined by the files within your study. This will look at the sample files and assign omic types based on their extensions. See the File Extensions table for a list of omic types and their corresponding file extensions.

    Number of Samples

    Number of samples within your study, based on the sample files.

    Number of Analyses

    Number of analyses within your study, including both in-progress and completed analyses.

    Study Description

    Description of your study, provided by the user during study creation.

    Region analysis

    Connected Multiomics offers tools to perform analysis on regions data.

    • Get Regional Methylation

    • Annotate regions

    Trajectory analysis

    Cells undergo changes to transition from one state to another as part of development, disease, and throughout life. Because these changes can be gradual, trajectory analysis attempts to describe progress through a biological process as a position along a path. Because biological processes are often complex, trajectory analysis builds branching trajectories where different paths can be chosen at different points along the trajectory. The progress of a cell along a trajectory from the starting point or root, can be quantified as a numeric value, pseudotime.

    Connected Multommics offers Monocle 2 and Monocle 3 methods.

    • Monocle 2

    • Monocle 3

    Difference Between Monocle 3 and Monocle 2

    Major updates in Monocle 3 (compared to Monocle 2) include:

    • Monocle 3 learns the principal trajectory graph in the UMAP space;

    • the principal graph is smoothened and small branches are excluded;

    • support for principal graphs with loops and convergence points;

    • support for multiple root nodes.

    Filters
    icon above the table
    to filter all columns based on metadata attributes. You can also drag and drop columns to reorder them.

    For example, to create a sample group of females with AB blood type, filter the Sex and ABO fields.

    Note how the sample number at the bottom of the page will change to reflect the number of samples that meet these conditions.

    Select all by clicking the check marks on the left-hand side. Now, click to create a sample group off of the filtered attributes. Choose to create a new sample group or add to an existing sample group. Give the sample group a name and click .

    You can view all your sample groups by clicking the Sample Groups tab in the top. From here, you can search for sample groups, and customize or filter the columns in your view.

    You can also click on the action icons to view more details and update your sample groups.

    Icon
    Action

    Opens a pop-up with details of your sample group.

    Allows you to edit the sample group's name.

    Allows you to delete the sample group.

    Metadata

    You can view the sample attributes of all the samples in your study in the Samples tab of your Study. These sample attributes are then used in your Analysis for the samples you selected. Your analysis will display a table of the samples used only in the analysis, along with the sample groups they originated from. You can download the sample data here in the Metadata tab of the analysis and manage the sample attributes.

    Log

    The Log tab displays a record of tasks from the task graph, including details such as the user who performed the task, as well as the start and end dates.

    Project Settings

    The Project Settings tab displays the details of the analysis, such as its name and description. From here, you can edit the analysis details, but note that the name cannot be changed.

    Data viewer

    In the Data Viewer tab, you can return to your saved sessions or start new sessions. Saved sessions will retain all the graphs and settings from your last use. You can click on a session to continue where you left off, or create a new Data Viewer session to set up a different data view with new graphs.

    Click into the Data Viewer to access your session. For guidance on how to analyze your data, refer to the Walkthroughs.

    You can navigate between studies and analyses using the breadcrumbs. The top breadcrumb allows you to move from studies to analyses, while the bottom breadcrumb shows either the analysis or the data viewer you have selected.

    Analyses
    Sample Naming

    Proteomic Data

    Sample IDs for proteomic data will be extracted from the sample names provided in the ADAT files. Each file can contain multiple samples. For example, the Sample ID in the ADAT file (left image) corresponds to the sample name in a row of the Connected Multiomics table (right image). A new sample will be created in Connected Multiomics for each Sample ID listed in the ADAT file.

    Single Cell RNA Data

    Sample IDs for scRNA data are obtained by removing the file extension from the file name. As a result, scRNA samples will be paired with their corresponding files based on the sample name. For example, the name of the scRNA files (left image) matches the sample name in a row of the ICM table (right image). Refer to Supported Data Types for information on the required files that make up a single scRNA sample.

    Spatial Data

    Sample IDs for spatial data are derived by removing the file extension from the file name. Thus, spatial samples are linked to their corresponding files through the sample name. For instance, the name of the spatial data files (left image) corresponds to the sample name in a row of the Connected Multiomics table (right image). Refer to Supported Data Types for details on the required files that make up a single spatial sample.

    Sample Statuses

    The following table describes the different sample statuses.

    Note that protein samples only display the "Ready" status.

    Status
    Description

    Ingested

    The sample has been ingested to the Study but may not have been uploaded to Analysis yet.

    Uploading

    The sample is in the process of uploading to Analysis. Samples cannot be used for analyses while they are uploading.

    Ready

    The sample is ready to be used in analysis.

    Failed

    The sample ingestion failed.

    Enter a Study Description

    Enter an optional description between 1 and 500 characters in length (including special characters). The study description can be updated later.

    3

    Select an Existing Project

    Select an ICA project to ingest data from. The project list will only display the projects accessible to your workgroup. When adding data to your study, the available data will come from the ICA project you select. Make sure to choose the correct project, as this cannot be updated later.

    4

    Auto-Import Data

    You can optionally choose to auto-import data into your study. If you select an omic type for auto-import, then whenever a pipeline corresponding to that omic type is completed, the output files will automatically be imported into your study. Note that it will not ingest any existing omic data already in the ICA project; it will only ingest output files from pipelines that complete after the option is selected. The omic types you choose to ingest can be changed later.

    Additionally, any new single-cell and spatial data that is added to a study will automatically create sample records and initiate a Default analysis workflow for each sample. For more information, please refer to the assay-specific walkthroughs.

    When you enable auto-ingestion, a notification will be created in ICA to subscribe to the pipeline events for the omic type you’ve selected. Do not delete this notification.

    Click Create Study to create your study. You will be redirected to the new study. For information on how to navigate within your study, see Enter Study.

    use matched pairs

    Use all samples to create baseline

    To normalize data to all the samples, choose to calculate the baseline using the mean or median of all samples for each feature, and choose to subtract baseline or ratio to baseline for the normalization method and click Finish.

    Use a group to create baseline

    When there is a subset of samples that serve as the baseline in the experiment, select use group for Choose baseline samples. The specific group should be specified using sample attributes.

    Choose use group, select the attribute containing the baseline group information, e.g. Phenotype in this example, the samples with the group Normal for the Phenotype attribute used as the baseline. The normal samples can be filtered out after normalization by selecting the Remove baseline samples after normalization check box if you don't want to include normal samples in the downstream analysis.

    Use matched pairs

    When using matched pairs, one sample from each pair serves as the control. An attribute specifying the pairs must be selected in addition to an attribute designating which sample in each pair is the baseline sample.

    After normalization, all values for the control sample will be either 0 or 1 depending on the normalization method chosen, Subtract baseline or Ratio to baseline respectively, so we recommend removing baseline samples when using matched pairs.

    For methylation array: select Methylation beta data node.

  • For Illumina 5-base DNA Prep: select 5-base methylation data node or Get Regional Methylation data node.

  • Select Methylation analysis from context-sensitive menu, select Detect Differential Methylation.

  • Select a factor for analysis, click Next.

    • Define and add comparison, then click Finish.

    Detect Differential Methylation output

    When completed, double-click on the output data node to open task report. Each row in the report table is a probe or methylation feature being tested. The table is sorted by p-values by default. Significance such as P-value and FDR step up are computed from the M-values. The LSMeans of the groups and the Difference are computed from the Beta-values.

    Click the Optional columns button to add more column data. For array data, annotation from the Illumina manifest file can be selected from the Optional columns.

    Click Volcano icon next to the comparison name, and icons buttons in View columns to open visualizations and additional details on the results.

    The table can be exported by clicking Download button on the top left corner of the table.

    ANOVA differential expression analysis
    Annotate cells

    You will be prompted to specify annotation input options:

    • Single file (all samples): requires one .txt file for all cells in all samples. Each row in the file represents a barcode and at least one barcode column which will match the barcodes in your data. It also requires a column containing Sample ID which must match the Sample name in the Metadata tab of your project.

    • File per sample: requires the format of all of the annotation files to be the same. Each file has barcodes on rows, it requires one barcode column that will match the barcodes in your data in that sample. All files should have the same set of columns, column headers are case sensitive.

    Browse to each sample file on the server to specify annotation files for all of the samples in the dialog.

    If you would like to annotate your matrix features with a gene annotation file, you can choose an annotation file at the bottom on the dialog. You can choose any gene/feature annotation available on the server. If a feature annotation is selected, the percentage of mitochondrial reads will be calculated using the selected annotation file.

    • Click Next to continue

    The next dialog page previews the attributes found in the annotations text file.

    You can choose which attributes to import using the check-boxes, change the names of attributes using the text fields, and indicate whether an attribute with numbers is categorical or numeric.

    • Click Finish to import the attributes

    A new data node, Annotated single cell counts, will be generated . The annotations will be available in downstream analysis tasks.

    to
    Yes
    in the
    Advanced Options
    . The warnings will then be treated the same way as estimation failures.

    To see the results for as many features as possible, regardless of how reliable they are, set Use only reliable estimation results to No and the result will be reported unless there is an estimation failure. For example, DESeq2 uses Cook’s distances to flag features with outlying expression values and if “Use reliable results” is set to Yes the p-values for such features are not reported which may lead to some missing values in the report (set Use only reliable estimation results to No to avoid this).

    Click the Exploratory analysis section in the toolbox

  • Click SVD

  • Features to include in calculation

    You don't have to use all the features in the computation, especially when the input matrix is very large, this option allows you to choose subset of features based on a selected statistics, the default is using the top 2000 features with the highest variance.

    Number of singular values to calculate

    When the matrix is large like single cell data, depends on what you would like to do downstream with SVD output, you don't have to compute all the values. Choose less number of values will reduce the running time of this task. By default, it output the top 100.

    Click the Finish button if you want to run the task as default.

    The task report for SVD is a scatterplot, each dot represents an observation in the input data. The output will be used for downstream analysis and visualization, including Harmony.

    References

    1. Cusanovich, D., Reddington, J., Garfield, D. et al. The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature 555, 538–542 (2018). https://doi.org/10.1038/nature25981

    Merge Cells/Samples

    To use the Merge cells option, the data matrices (one or more) that are to be merged with the currently selected one should have the same features (e.g. genes), but distinct cells. Push the Select data nodes button to display a preview of the pipeline; the data nodes that can be merged are shown in color of the branch, other data nodes are disabled (greyed out). Left click on the data node that you want to merge with the current one and click the Select button, you can select multiple data nodes to merge. The selected node(s) will be shown under the Select data nodes button. If you made a mistake, use the Clear selection icon. Push Finish to proceed.

    Merge Features

    To use the Merge features option, the data matrices (one or more) that are to be merged with the currently selected one should have the same cells (or samples), but distinct features (e.g. gene and protein expression). Push the Select data nodes button to display a preview of the pipeline; the data nodes that can be merged are shown in color of the branch, others are disabled (greyed out). Left lick on the data node that you want to merge with the current one and push the Select button. The selected node will be shown under the Select data nodes button. Repeat the procedure if you would like to merge additional nodes. If you made a mistake, use the Clear selection icon. Push Finish to proceed.

    The output of the Merge matrices task is a Merged counts data node.

    Click Split by feature type

    The Split matrix task will run and generate output data nodes for each of the feature types. For example, if there are Antibody Capture and Gene Expression feature types in the input, Split matrix will generate two data nodes. Every sample is included in both matrices.

    Sample Metadata

    Sample Metadata

    Sample metadata allows you to add important additional information to your study that can be used to create sample groups or make comparisons as part of your study design. Sample metadata can be uploaded from local storage or added to an ICA project.

    Metadata mapping

    Metadata columns are populated based on the information in your metadata TSV file. Metadata files are required per study. Upon importing a metadata file, the samples will automatically be updated to reflect the metadata attributes. The sample names in the metadata file must match those in the data files. For example, in the figure below, you can see that the Sample ID in the ADAT file matches the SampleID in the metadata file. As a result, the attribute columns in the metadata file will correspond with the metadata columns available for selection in the Samples table.

    The column order does not matter but you must have a column headed with "SampleID" (no space) and the column values need to match the Sample ID in the software.

    You can add multiple metadata files. In this case, any samples that do not overlap with existing ones will be added as new entries in the Samples table.

    Metadata cannot be overwritten. If an erroneous metadata file is added, a new study and analysis should be created.

    Download Sample Metadata file

    Download a formatted Metadata file example for your study to review the metadata information. Make sure to correct the SampleID column header to match the exact name of the samples.

    • Navigate to Sample groups > Info > Download Metadata. Click the Sample groups tab, under Action click the Info button, then use the Download Metadata button to download an example metadata file to your machine.

    Metadata files should be saved as a .tsv or .csv extension.

    Upload sample metadata file to Connected Multiomics

    Upload the sample metadata file to Connected Multiomics to update the sample records of the study.

    • Within a study, click + Add Data > Upload metadata

    • Upload data from your local or server based drives. Click Browse to select the files or drag and drop the files to the purple box.

    This will not add the metadata file to the ICA project as a stored file. The metadata uploaded a Connected Multiomics study will be used only in this study. It will need to be added again to other studies later if you want to use it again for other samples. To add metadata to the ICA project you will need to do this within the ICA interface.

    Using a sample metadata file contained in the Illumina Connected Analytics project

    To add data, click the button in the top right. You can and choose your data type and format. TSV metadata files can always be selected. For example, if you choose Bulk > Proteomics as the data type, you will only see ADAT files and TSV files as options for upload.

    Sync metadata from Study

    After the new metadata file has been added to the Study, users will be able to update their metadata within an existing analysis by clicking the "Sync metadata from study" button in the Metadata tab.

    A new confirmation dialog will appear to make sure users understand what will happen.

    Once the above Sync metadata button is pushed, the metadata will be updated accordingly.

    No metadata will be applied to the duplicate samples if the metadata file does not contain the sample_uuid.

    Data summary report

    The Data summary report in Connected Multiomics provides an overview of all tasks performed as part of a pipeline. This is particularly useful for report writing, record keeping and revisiting projects after a long period of time.

    Viewing the Data Summary Report

    Click on an output data node under the Analyses tab of a project and choose Data summary report from the context sensitive menu on the right. The report will include details from all of the tasks upstream of the selected the node. If tasks have been performed downstream of the selected data node, they will not be included in the report.

    Each task will appear as a separate section on the Data summary report.

    Click the arrow ( / ) to expand and collapse each section. When expanded, the task name, user that performed the task, start date and time, duration and the output file size are displayed. To view or hide a table of task settings, click Show/hide details.

    Saving the Data Summary Report

    The Data summary report can be saved in different formats via the web browser. The instructions below are for Google Chrome. If you are using a different browser, consult your browser's help for equivalent instructions.

    Save as a PDF

    On the Data summary report, expand all sections and show all task details. Right-click anywhere on the page and choose Print... from the menu or use Ctrl+P (Command+P on Mac).

    In the print dialog, set the destination to Save as PDF. Select the Background graphics checkbox (optional), click the blue Save button and choose a file location on your local machine.

    The PDF can be attached to an email and/or opened in a PDF viewer of your choice.

    Save as HTML

    On the Data summary report, right-click anywhere on the page and choose Save as… from the menu or use Ctrl+S (Command+S on Mac). Choose a file location on your local machine and set the file type to Web Page, Complete.

    The HTML file can be opened in a browser of your choice.

    Harmony

    It is challenging to analyze scRNA-seq data, particularly when they are assayed with different technologies. Because biological and technical differences are interspersed. Harmony[1] is an algorithm that projects cells into a shared embedding where cells group by cell type rather than dataset-specific conditions. Harmony is able to simultaneously account for multiple experimental and biological factors while integrating different datasets.

    Harmony can be invoked in Batch removal section only on PCA data node

    1. the data has some categorical attributes (only categorical attributes can be included in the model)

    2. PCA data node is selected

    To run Harmony,

    • Click a PCA data node

    • Click the Batch removal section in the toolbox

    • Click Harmony

    You will be prompted to pick some attribute(s) for analysis. To set up the model, you need to choose which attributes should be considered. For example, in the case of one dataset that has different cell types from multiple batches, the batch may have divergent impacts on different cell types.

    To remove batch effects with default settings,

    • Click Batch

    • Click Add factors

    • Click Finish

    Diversity clustering penalty (theta): Default theta=2. Higher value of penalty will have stronger correction, which results in better mixing . Zero penalty means no correction. The range of this value is from 0 to positive infinity.

    The output of Harmony is a new data node. This data node contains the Harmony corrected values and can be used as the input for downstream tasks such as Graph-based clustering, UMAP and T-SNE

    Users can click Configure to change the default settings In Advanced options.

    Number of clusters (nclust): Number of clusters in model. Set this to the distinct count of cell types. nclust=1 equivalent to simple linear regression. Use 0 to enable Seurat’s RunHarmony() default setting.

    Width of soft kmeans clusters (sigma): The range of this value is from 0 to positive infinity. When set it to 0, an observation will be assigned to 1 cluster (hard clustering). When the value is greater than 0, the observation will be potentially belong to multiple clusters (soft clustering, or fuzzy clustering). Default sigma=0.1. Sigma scales the distance from a cell to cluster centroids. Larger values of sigma result in observations assigned to more clusters. Smaller values of sigma make soft kmeans cluster approach hard clustering.

    Ridge regression penalty (lambda): Default lambda=1. Lambda must be strictly positive. Smaller values result in more aggressive correction.

    Random seed: Use the same random seed to reproduce the results.

    References

    1. Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, Baglaenko Y, Brenner M, Loh P-r, Raychaudhuri S. Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods; 2019. .

    Find multimodal neighbors

    Multiomics single cell analysis is based on simultaneous detection of different types of biological molecules on the same cells. Common multiomics techniques include feature barcoding or CITE-seq (cellular indexing of transcriptomes and epitopes by sequencing) technologies, which enable parallel assessment of gene and protein expression. Specific bioinformatics tools have been developed to enable scientists to integrate results of multiple assays and learn relative importance of each type (or each biological molecule) in identification of cell types. Connected Multiomics supports weighted nearest neighbor (WNN) analysis (1), which can help combine output of two molecular assays.

    Invoking Find Multimodal Neighbours

    This task can only be performed on data nodes containing PCA scores – which are PCA output and graph based clustering output nodes generated from PCA nodes. To start, select a PCA data node of one of the assays (e.g. gene expression) and go to Exploratory analysis > Find multimodal neighbors in the toolbox. On the task setup page, use the Select data node button to point to the PCA data node of the other assay (e.g. protein expression), by default, there is a node selected.

    When you click the Select data node button, another dialog will be open, showing your current pipeline. Data nodes that can be used for WNN are in color of the branch, other nodes are disabled (greyed out). To pick a node, left-click on it and then push the Select button.

    The selected data node is shown under the Select data node button. If you made a mistake, use the Clear selection link.

    If there are graph-based clustering task performed on PCA data node, the output of graph-based clustering node also has PCA score from the input data, so the output graph-based clustering data nodes also can be candidate of WNN task.

    To customize the Advanced options, select the Configure link. At present you can only change the number of nearest neighbors for each modality (-k.nn option of the Seurat package); the default value is 20. An illustration on how to use that option to assess the robustness of WNN analysis can be found in Hao et al. (1). The nearest neighbor search method is K-NN and distance metric is Euclidean.

    To launch the Find multimodal neighbors task, click the Finish button on the task setup page. For each cell, the WNN algorithm calculates its closest neighbors based on a weighted combination of RNA and protein similarities. The output of the Find multimodal neighbors task is a WNN data node.

    For downstream analysis, you can launch a UMAP or graph-based clustering tasks on a WNN node. For example, The example below shows a snippet of analysis of a feature barcoding data set; gene expression and protein expression data were processed separately, and then Find multimodal neighbors was invoked on two respective PCA data nodes. UMAP and graph-based clustering tasks were performed on WNN node.

    References

    1. Hao Y, Hao S, Andersen-Nissen E, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573-3587.e29. doi:10.1016/j.cell.2021.04.048

    UMAP

    What is UMAP?

    Uniform Manifold Approximation and Projection (UMAP) is a dimensional reduction technique [1]. UMAP aims to preserve the essential high-dimensional structure and present it in a low-dimensional representation. UMAP is particularly useful for visually identifying groups of similar samples or cells in large high-dimensional data sets such as single cell RNA-Seq.

    Running UMAP

    UMAP task can run on any counts data node, however, it is very computationally intensive, we recommend running PCA first and run this task on PC output data node using the top few number of PCs.

    • Click the counts data node or PCA data node (recommended)

    • Click the Exploratory analysis section of the toolbox

    • Click UMAP

    • Click Finish to run

    Initialize output values

    Sets the initialization mode. Options are Spectral and Random.

    Spectral - good initial points are chosen using spectral embedding (more accurate)

    Random - random initial points are chosen (faster)

    PCs to use

    Choose how many PCs to use, default is the top 20 PCs. Low number of PCs will reduce the run time.

    Split cells by sample

    Chose whether to run UMAP on all samples together or on each sample individually.

    Checking the box will run UMAP on each sample individually.

    UMAP produces a UMAP task node. Opening the task report launches a scatter plot showing the UMAP results. Each point on the plot is a cell for single cell data or a sample for bulk data. The plot will open in 2D or 3D depending on the user preference.

    UMAP vs. t-SNE

    Both t-SNE and UMAP are dimensional reduction techniques that are useful for identifying groups of similar samples in large high-dimensional data sets. A comparison of the techniques for visualizing single cell RNA-Seq data by the authors of UMAP suggests that UMAP runs faster, is more reproducible, gives a more meaningful organization of clusters, and preserves more information about the global structure of the data than t-SNE [2].

    We find UMAP to be more informative than t-SNE for many data sets. For example, the similarities and differences between clusters are clearly visible with UMAP, but more difficult to judge with t-SNE.

    Advanced UMAP parameters

    Local neighborhood size

    UMAP preserves the local structure of the data by focusing on the distances between each point and its nearest neighbors. Local neighborhood size is the number of nearest neighbors to consider.

    You can adjust this value to prioritize global or local relationships. Smaller values will give a more local view, while larger values will give a more global view. Default is 30.

    Minimal distance

    The effective minimum distance between embedded points. Smaller values will create a more clustered embedding, while larger values will create a more evenly dispersed embedding.

    You can decrease this value to make clusters more tightly packed or increase it to make them looser. Default is 0.3.

    Distance metric

    The metric to use when computing distances in high-dimensional space. Options are Euclidean, Manhattan, Chebyshev, Canberra, Bray Curtis, and Cosine. Default is Cosine.

    Number of iterations

    UMAP uses an iterative algorithm to optimize the low-dimensional representation. The value 0 corresponds to the default, which chooses the number of iterations based on the size of the input data. More iterations will result in a more accurate embedding, but will take longer to run. Default is 0.

    Random generator seed

    Several parts of UMAP utilize a random number generator to provide an initial value. Default is 42. To reproduce the results, use the same random seed at all runs.

    References

    [1] McInnes L and Healy J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv, 2018, e-prints 1802.03426,

    [2] Becht E, McInnes L, Healy J, Dutertre A-C, Kwok I, Guan Ng L, Ginhoux F, and Newell E, Dimensionality reduction for visualizing single-cell data using UMAP, Nature Biotechnology, 2019, 37, 38-44.

    Annotate features

    This task allows you to annotate features only to the matrix data node.

    • Click a Single cell counts data node

    • Click the Annotation/Metadata section in the toolbox

    • Click Annotate features

    The assembly and annotation files should be consistent with the files used in upstream analysis.

    • Click Finish

    A new data node, Annotated counts, will be generated . The annotations will be available in downstream analysis tasks.

    Merge differential expression results

    Why Merge differential expression results

    Plotting fold changes of different data types (mRNA and Protein in this example) together is a common integrative approach in multi-omics studies. By visually examining the scatter plot and specifically looking at the outlier genes, researchers can gain crucial insights into the hierarchy and kinetics of regulation within the biological system being studied. In ICM, this analysis can be performed by running the “Merge Differential Expression Results” task.

    Running Merge differential expression results

    The task can be invoked on any differential analysis results data node.

    • Click a data node of differential analysis such as DESeq2

    • Click the Exploratory analysis section of the toolbox

    • Click Merge differential expression results

    • In the task set up page, push the Select data node button to pick the right option and hit the Select button

    • Hit the Next button to proceed

    • Select the matching features for each data node

    • Click Finish to run

    The result is stored under a Merge results node. To open it, double click on the node or select the respective Task report from the context sensitive menu.

    Task report

    For this task, the report directly opens a 2D scatter plot in Data Viewer. Fold changes of different data types (mRNA and Protein in this example) in log2 scale are labeled at two axes separately. In the example, mRNA fold change reflects transcriptional activity (gene expression at RNA level); Protein fold change reflects translational and post-translational processes. By plotting them together, one can see how changes in mRNA abundance correlate with protein abundance.

    Pre-analysis tools

    This section constitutes tools that are useful in preparing single cell data for downstream analysis, such as multi-sample comparison or multiomics analysis. To invoke Pre-analysis tools, click on any Single cell counts data node. These include the following tasks:

    • Pseudobulk

    • Split by feature type

    • Generate group cell counts

    Spatially Variable Genes

    Spatially variable genes (SVGs) are genes whose expression patterns vary significantly across different spatial locations within a tissue. Identifying SVGs is a key step in spatial transcriptomics analysis and can reveal biologically meaningful spatial expression patterns, tissue architecture, cell-type niches, or gradients of signaling molecules.

    We have implemented PROST, a highly scalable algorithm for SVG detection [1].

    Performing SVG analysis

    The task can be invoked on any non-normalised node containing spatial data, we recommend filtering cells and genes before running the task. The analysis is species-agnostic.

    • Select the appropriate node and click on 'Statistics>Spatially Variable Genes'

    • Edit the task settings as necessary:

    • Note the percentage parameter can be adjusted depending on dataset size, this may affect running times.

    • Adjust the advanced settings as needed:

    • We currently recommend using 2,000 HVGs for PROST calculation in order to make the task more scalable on large datasets. This parameter can be increased to include all genes in the data, but may severely affect performance.

    • Click Finish to run the task.

    Once the task has completed you will nee a new 'Spatially variable genes' node on the task graph:

    Clicking twice on the node will open the task report, a table of the most significant SVGs identified:

    The table contains the genes identified as SVGs and the PROST Index (PI) per feature, an indicator of spatial variability [1]. The results can be downloaded as a table, or visualised in the Data viewer. Here is an example:

    References

    [1] Liang, Yuchen, et al. "PROST: quantitative identification of spatially variable genes and domain detection in spatial transcriptomics." Nature Communications 15.1 (2024): 600.

    Annotate Variants (SnpEff)

    An important aspect of variant analysis is the ability to prioritize specific variants for further investigation. As variant detection can often identify a large number of variants, it may be difficult to determine which variants may impact phenotypes. SnpEff (version 4.1k) provides a means to annotate and predict the effects of variants on genes, allowing for prioritization of variants within the project. In addition, the SnpEff databases utilized for prediction support a large number of genome assemblies. Information regarding the implementation of the predictions is detailed by Cingolani et al. [1] The predicted effect of the variant is categorized by impact:

    • HIGH - frame shifts, addition/deletion of stop codons, etc;

    • MODERATE – codon change/deletion/insertion, etc;

    • LOW – synonymous changes, etc;

    • MODIFIER – changes outside coding regions, etc.

    Further details about output metrics can be found in the . The Annotate variants (SnpEff) task can be invoked from any Variants or Annotated variants data node, and the task will supplement any existing annotation in the vcf files. Annotation information will also be visible in the View variants Variant report and the Cohort mutation summary report

    Annotate variants (SnpEff) dialog

    The task dialog for Annotate variants (SnpEff) contains two sections: Select SnpEff database and Advanced options. Select SnpEff database will specify the reference assembly to utilize for variant detection. If the variant detection was performed in Connected Multiomics, the Assembly will be displayed as text in the section, and you do not have the option to change the reference. In the event that variant detection was performed outside of Connected Multiomics, you will need to select the appropriate Assembly utilized for variant detection in the drop-down list. Assemblies previously added will be available for selection or New assembly… can be utilized to import the reference sequence within the task. Select SnpEff database will allow selection of databases utilized for prediction, and Connected Multiomics provides automated download of a limited number of these databases. Databases previously added will be available for selection or Add SnpEff variant database in the menu can be utilized to import the reference sequence within the task. Additional information of SnpEff databases can be found in the .

    Advanced options provides a means to tune parameters for annotation generated from the SnpEff database. Upon invoking the task dialog, Option set is set to Default, and these parameters are prescribed by the developers of SnpEff. Clicking Configure will open a window to tune advanced options. SnpEff has Advanced options for Results filter options, Annotation options, and Database options. Moving the mouse cursor over the info button will provide details for each parameter.

    References

    1. Cingolani P, Platts A, Wang LL, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012;6(2):80-92.

    Combine Variants

    Due to fundamental differences between the statistical models employed by different variant detection tools, as well as varying parameter optimizations within tools selected for discrete discovery goals, there can be a large number of unique and common variants identified between each instance of variant detection. In some cases the goal of the study is to provide a list of all possible variants, whereas in other studies the goal is to generate a list of variants with increased confidence of true polymorphic sites. To facilitate both possible goals, Combine variants tool generates either the union or intersection of two variants data nodes. This task provides a means to identify common and unique variant calls in samples that have undergone two discrete variant calling tasks. The Combine variants task can be invoked from any Variants or Annotated variants data node, assuming at least discrete variants node(s) exists in the analysis. The task will generate two new variant data nodes and underlying vcf files: one for the union and one for the intersection of the variant data.

    To run the task, select a Variants data node and then click the Combine variants from the task menu. The task dialog will allow you to select a second Variants data node to be combined with the first. The selection allows for only one other data node to be used in the task. If there are no other valid variant tasks available within the project, a message stating "No connections to upstream task found" will be displayed.

    Currently, this is the only task in Connected Multiomics that requires two input data nodes and then generates two output data nodes.

    Imported count matrix report

    The imported count matrix report is a summary report on sample distribution information of imported counts matrix data e.g. ILMN miRNA count matrix data

    • Click an imported miRNA data node

    • Click the QA/QC section of the toolbox

    • Click Imported count matrix report

    A new task node is generated with the Imported count matrix report. Double click on the report to open it.

    In the Feature distribution table title, it displays the size of the matrix, number of samples and number of features. Each row is a sample in the table, columns contains descriptive statistics of features in the sample.

    If there are less than 30 samples in the data node, a bar chart is presented. Each bar is a sample. The X-axis is the read count range, Y axis is the number of features within the range. Hovering your mouse over the bar displays the following information:

    • Sample name

    • Range of read counts, “[ “represent inclusive, “)” represent exclusive, e.g. [0,0] means 0 read counts; (0,10] means the range is greater than 0 count but less than and equal to 10 counts.

    • Number of features within the read count range

    • Percentage of the features within the read count range

    A Box-whisker plot is displayed below the bar chart. In the box-whisker plot, each box is a sample on X-axis, the box represents 25th and 75th percentile, the whiskers represent 10th and 90th percentile, Y-axis represents the feature counts, when you hover over each box, detailed sample information is displayed

    • Sample name

    • Range of read counts, “[ “represent inclusive, “)” represent exclusive

    • Number of features within the read count range in the sample

    TF-IDF normalization

    Latent semantic indexing (LSI) was first introduced for the analysis of scATAC-seq data by Cusanovich et al. 2018[1]. LSI combines steps of frequency-inverse document frequency (TF-IDF) normalization followed by singular value decomposition (SVD). Connected Multiomics wrapped Signac's TF-IDF normalization for single cell ATAC-seq dataset. It is a two-step normalization procedure that both normalizes across cells to correct for differences in cellular sequencing depth, and across peaks to give higher values to more rare peaks[2].

    To run TF-IDF normalization,

    • Click a single cell ATAC counts data node

    • Click the Normalization and scaling section in the toolbox

    • Click TF-IDF normalization

    The output of TF-IDF normalization is a new data node that has been normalized by log(TF x IDF). We can then use this new normalized matrix for downstream differential analysis and/or visualization.

    References

    1. Cusanovich, D., Reddington, J., Garfield, D. et al. The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature 555, 538–542 (2018).

    Compare clusters

    What is Compare clusters?

    Compare clusters is a tool to identify the optimal number of clusters for K-means Clustering using the Davies-Bouldin index. The Davies-Bouldin index is a measure of cluster quality where a lower value indicates better clustering, i.e., the separation between points within the clusters is low (tight clusters) and separation between clusters is high (distinct clusters).

    Running Compare clusters

    We recommend normalizing your data prior to running Compare clusters, but the task will run on any counts data node.

    • Click the counts data node

    • Click the Exploratory analysis section of the toolbox

    • Click Compare clusters

    • Configure the parameters

    The parameters for Compare clusters are the same as for .

    Compare clusters task report

    The Compare clusters task report is line chart with the number of clusters on the x-axis and the Davies-Bouldin index on the y-axis.

    The Compare clusters task report can be used to run K-means clustering.

    • Click a point on the plot to select it or type the number of clusters in the text box Partition data into clusters

    Selecting a point sets it as the number of clusters to partition the data into. The number of clusters with the lowest Davies-Bouldin index value is chosen by default.

    • Click Generate clusters to run K-means clustering with the selected number of clusters

    A K-means clustering task node and a Clustering result data node are produced. Please see our documentation on for more details.

    Seurat3 integration

    Seurat v3[1] introduced new methods for the integration of multiple single-cell datasets, no matter whether they were collected from different individuals, experimental conditions, technologies, etc. Seurat 3 integration method aims to use a subset of the data as reference for the integrate analysis. The method integrates all other data with the reference subset. The subset can be one sample or a subgroup of samples defined by the factor attribute.

    Seurat3 integration in Flow can be invoked in Batch removal section if a Normalized counts data node is selected.

    To run Seurat3 integration,

    • Click a Normalized counts data node

    • Click the Batch removal section in the toolbox

    • Click Seurat3 Integration

    You will be promoted to pick some attribute(s) for analysis. The first Seurat3 integration dialog is a drop-down list that includes the factors for data integration. To set up the model, choose the attribute represents the batch factor, and click Finish.

    The output of Seurat3 integration is a new data node - Integrated counts. We can then use this new integrated matrix for downstream analysis and visualization.

    Users can click Configure to change the default settings In Advanced options.

    Use reference to find anchors: when this box is checked, the first group of the selected attribute is used as reference to find anchors. To use a different group as reference, change the order of subgroups of the attribute in the attribute management page on Metadata tab. When the box is unchecked, anchors will be identified by comparing all pairs of subgroups, this option is very computationally intensive.

    Perform L2 normalization: Perform L2 normalization on the CCA cell embeddings after dimensional reduction.

    Pick anchors: How many neighbors to use when picking anchors.

    Filter anchors: How many neighbors to use when filtering anchors.

    Score anchors: How many neighbors to use when scoring anchors.

    Nearest neighbor finding methods: Method for nearest neighbor finding. Options include: rann, annoy.

    References

    1. Stuart T, Butler A, Hoffman P, et al. Comprehensive integration of single-cell data. Cell, 2019. DOI:

    Variant Analysis

    Variations in nucleotide sequence, in the form of single nucleotide variants (SNVs) and insertion and deletion events (INDELs), can either be neutral in nature or can have functional effects. Connected Multiomics provides all the tools necessary to interrogate and prioritize variants for further analysis. Variants stored in Variant Call Format (vcf) files can be analyzed to filter, annotate, summarize, visualize, and validate your panel of identified variants. Multiple vcf processing tools are available under the Variant analysis section of the context sensitive menu.

    • Annotate Variants

    • Annotate Variants (SnpEff)

    Pseudobulk

    Pseudobulk task combines expression values from all cells of a particular cell type classification for each sample. In essence, it creates virtual bulk level data from single cell level data. Because it is virtual bulk data, all the same tasks that can be performed on bulk level gene counts data can be performed on the output of Pseudobulk task.

    Pseudobulk makes it easy to compare gene or protein expression for a cell type of interest between experimental groups.

    Before running Pseudobulk , you must classify the cells. To run Pseudobulk , select the data node with your classified cells and invoke this task.

    Select how you would like to group the cells, by default sample name is selected, this option allows you to pool all the cells in a sample to generate sample level expression. You can add other attribute from the drop-down list, e.g. cell type, then it will pool cells in each cell type within one sample as sample level expression on all the features.

    Agrreation method option are Sum, Maximum, Mean, and Median. Expression values for cells from the same sample with the same cell type classification will be merged using the chosen pooling method. If the input data node contains raw count, Sum is recommended; if the input data node contains normalized count, Mean or Median will make more sense

    After clicking Finish, a Pseudobulk data node will be generated which contains bulk level expression data.

    Cover

    Generates methylation and genomic variant calls on 5-base DNA assay data

    Cover

    Links experimental results to curated studies, pathways, and clinical data for enhanced biological interpretation

    Attribute report

    Attribute report

    This task is only available on single cell matrix data node. It will summarize the cell level attributes of a data node and the result displayed in two tables with one table containing the categorical attributes while the other table contains the numerical attribute.

    Running attribute report

    Click on any single cell count data node and select Attribute report from the Annotation/Metadata task menu

    Double click on the result node to view the tables.

    The result of the attribute report summarize the categorical and numerical attributes in two tables.

    To download a text-file version of one of the tables, click Download in upper-left corner of the table.

    Cell barcode QA/QC

    The Cell barcode QA/QC task lets you determine whether a given cell barcode is associated with a cell. This is an important QC step in all droplet-based single cell RNA-seq experiments, where all barcodes are sequenced.

    To invoke Cell barcode QA/QC:

    • Click a Single cell counts data node

    • Click the QA/QC section of the task menu

    • Click

    Annotate Regions

    ATAC-seq and 5-base DNA methylation analysis generate regions in genome. To better understand roles of the regions in regulating gene expression, we can use Annotate regions task to add information about overlapping or nearby genomic features to the regions.

    Running Annotate regions

    The input for Annotate regions task is a region data node.

    • Click to select a regions data node.

    Differential Analysis

    Powerful Partek Flow statistical analysis tools help identify differential expression patterns in the dataset. These can take into account a wide variety of data types and experimental designs.

    General linear model

    This method is based on general linear model, much like ANOVA in reverse, it calculates the variation attributed to the factor(s) being removed then adjusting the original values to remove the variation.

    By including batch in the differential analysis model, the variability due to the batch effect is accounted for when calculating p-values. In this sense, batch effects are best handled as part of the differential analysis model. However, clustering data or visualizing biological effects can be very difficult if batch effects are present in the original data. We transform the original values to remove the batch effect using this tool.

    We recommend normalizing your data to log normal distribution prior to removing batch effects using this method, but the task will run on any counts data node.

    • Click the counts data node

    • Click the Batch removal

    Welch's ANOVA

    Welche's ANOVA is similar to one-way ANOVA, but it doesn't have equal variance assumption. It applies to data with an attribute has more than two groups, but the variances among the groups are not equal. It is based on weighted means. When the attribute has only two groups, it is equivalent to the unequal variance t-Test (also known as Welch's t-Test).

    Running the task

    To invoke Welch's ANOVA, select any count-based data nodes, these include:

    • Gene counts

    PCA

    Principal components analysis (PCA) is an exploratory technique that is used to describe the structure of high dimensional data by reducing its dimensionality. It is a linear transformation that converts n original variables (e.g. genes/transcripts/protein) into n new variables, which are called PCs, they have three important properties:

    • PCs are ordered by the amount of variance explained

    • PCs are uncorrelated

    • All PCs explain all variation in the data

    PCA is a principal axis rotation of the original variables that preserves the variation in the data. Therefore, the total variance of the original variables is equal to the total variance of the PCs.

    Kruskal-Wallis / Wilcoxon

    Both Kruskal-Wallis and Wilcoxon tests are rank tests, such rank-based tests are generally advised for use with larger sample sizes. They both can only take one factor into account at a time. Kruskal-Wallis can perform on an attribute with two or more subgroups.

    Wilcoxon test is a close alternative to Kruskal-wallis task, match the results of . This test is also called "Wilcoxon Rank-Sum Test" or "Mann-Whitney U Test". When you perform comparisons on the two groups, it will filter only include the two groups first and then perform the differential analysis.

    Running the task

    To invoke the Kruskal-Wallis test, select any count-based data nodes, these include:

    Feature distribution

    The Feature distribution plot visualizes the distribution of features in a counts matrix data node.

    Running Feature distribution

    To run Feature distribution:

    • Click a counts data node

    SCTransform

    SC transform task performs the variance stabilizing normalization proposed in [1]. The task's interface follows that of SCTransform() function in R [2]. SCTransform v2 [3] provides the ability to perform downstream differential expression analyses besides the improvements on running speed and memory consumption. v2 is the default method.

    We recommend performing the normalization on a single cell raw count data node. Select SCTransform task in Normalization and scaling section on the pop-up menu to invoke the dialog.

    By default, it will generate report on all the input features. Unchecking the Report all features, user can limit the results to a certain number of features with highest variance.

    In Advanced options, users can the click Configure to change the default settings.

    Scale results: Whether to scale residuals to have unit variance; default is FALSE

    Center results: When set to Yes, center all the transformed features to have zero mean expression. Default is TRUE.

    DESeq2

    The algorithm details for DESeq2 can be found at the external .

    If the value of the raw count includes a decimal fraction, the value will be rounded to an integer before DESeq2 is performed. Before you run this task, we recommend that you first remove (filter out) features expressed at a low level and then perform normalization using Median ratio (DESeq2 only).

    Note: DESeq2 differential analysis can only be performed on the following normalization output data node, those methods can produce library sizes:

    TMM, CPM, Upper Quartile, Median ratio, Postcounts

    Configuring DESeq2

    AUCell

    What is AUCell?

    is a tool to identify cells that are actively expressing genes within a gene list [1]. For each input gene list, AUCell calculates a value for each cell by ranking all genes by their expression level in the cell and identifying what proportion of the genes from the gene list fall within the top 5% (default cutoff) of genes. This method allows the AUCell value to represent the proportion of genes from the gene list that are expressed in the cell and their relative expression compared to other genes within the cell. Because this is a rank-based method and is calculated for each cell individually, AUCell can be run on raw or normalized data. As an AUCell value is the proportion of genes from the list that are within the top percentile of expressed genes, AUCell values can range from 0 to 1, but may have a more restricted range.

    AUCell values can be used directly as input for downstream analysis, such as clustering. Another common use is to set an AUCell value cutoff for expressing vs. not and used this to classify cells. AUCell values will separate cells most effectively when the genes in the list are highly and specifically expressed in a population of cells. If the genes are specifically expressed, but not highly expressed, the AUCell value will not be as useful.

    Exploratory analysis

    Connected Multiomics offers a wide variety of tools to help you explore your data. Which tools are available depends on the type of data node selected.

    t-SNE

    What is t-SNE?

    t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensional reduction technique [1]. t-SNE aims to preserve the essential high-dimensional structure and present it in a low-dimensional representation. t-SNE is particularly useful for visually identifying groups of similar samples or cells in large high-dimensional data sets such as single cell RNA-Seq.

    Running t-SNE

    Scran deconvolution

    Library size normalization is the simplest strategy for performing scaling normalization. But composition biases will be present when any unbalanced differential expression exists between samples. The removal of composition biases is a well-studied problem for bulk RNA sequencing data analysis. However, single-cell data can be problematic for these bulk normalization methods due to the dominance of low and zero counts[1]. To overcome this, Connected Multiomics wrapped the calculateSumFactors() function from R package . It pools counts from many cells to increase the size of the counts for accurate size factor estimation. Pool-based size factors are then “deconvolved” into cell-based factors for normalization of each cell’s expression profile[1].

    To run Scran deconvolution,

    • Click a single cell counts data node

    • Click the Normalization and scaling section in the toolbox

    Filter Variants

    Variant detection can identify large numbers of variants, dependent both on the size of the regions being interrogated and the parameters utilized during detection. As such, filtering variants is often a necessary task to aid in identifying variants that may be relevant for downstream investigation. The Filter variants task enables users to filter variant data both in regards to quality metrics generated during detection and annotation information. The task can be invoked from any Variants or Annotated variants data node.

    Filter variants dialog

    The Filter variants task dialog can contain two to five sections, dependent on the variant caller used for detection and the level of annotation. All instances of the task will include the following: Include region overlapping variants and a section for Quality.

    Get Regional Methylation

    The Get Regional Methylation task uses a region annotation (.bed) file to generate regions from CpG methylation count data. For each region specified in the annotation file, methylation counts of all CpG sites within the region are averaged to produce a count value in a regions-by-samples count matrix.

    Using a built-in promoter regions file

    A built-in annotation model based on human (hg38) promoter regions from ENSEMBL version 114 is provided. To invoke Get Regional Methylation task,

    • Click on 5-base Methylation

    Annotate Variants

    The Annotate variants task provides a means to add information with regards to genomic features, such as transcript models, and existing variant databases to the variants contained in the projects. This information can be useful for filtering, interpreting, and prioritizing variants for downstream investigation. The Annotate variants task can be invoked from any Variants or Annotated variants data node, and the task will be added to and supplement any existing annotation in the underlying vcf files. Annotation information will also be visible in the downstream View variants Variant report.

    Annotate variants dialog

    The task dialog for Annotate variants contains three sections: Assembly, Annotate with genomic features, and Annotate with known variants

    Annotate Variants (VEP)

    An important aspect of variant analysis is the ability to prioritize variants for downstream analysis. As variant detection can often identify a large number of variants, it may be difficult to determine which variants may impact phenotypes. As implemented in Connected Multiomics, the Ensembl Variant Effect Predictor (VEP, version 84) [1] provides a means to add detailed annotation to variants in the analysis such as discrete aspects of transcript models and variant databases not available in the task. For variants identified in human data, information from popular tools that predict the impact of variants that cause amino acid changes, SIFT [2] and PROVEAN [3] (available for the hg19 genome assembly), will be included. VEP databases can be obtained for multiple species, and content will be dependent on available transcript and variant information for that organism. The Annotate variants (VEP) task can be invoked from any Variants or Annotated variants data node, and the task will supplement any existing annotation in the vcf files. Annotation information will also be visible in the downstream View variants Variant report .

    Annotate variants (VEP) dialog

    Filter samples/cells

    Filter samples or cells depends on the input data in order to perform downstream analysis on a subset of data.

    Click a count matrix or single cell counts data node, in the Filtering section of the pop-up menu, and choose to Filter samples (bulk data) or Filter cells (single cell data).

    The dialog lets you build a series of filters based on sample or cell attributes.

    Click Finish to apply the filter. If no sample or cell will pass the filter criteria, a warning message will appear and the task will not run.

    Filter by metadata

    The first drop-down menu allows you to choose to

    Normalize to housekeeping genes

    This normalization is performed on observations (samples) using internal control features (genes). The internal control features, usually housekeeping genes, should not vary among samples[1].

    Note: The input data node must contain all positive values to compute geometric mean.

    Select Normalize to housekeeping genes task in Normalization and scaling section in the pop-up menu when you select a count matrix data node, the dialog will list all the features included in the data node on the left panel

    Select control genes on the left panel and move them to the right panel. You can also use search box to find the feature and click the plus button to add it to the right panel.

    Click Finish.

    The implementation details is as follows:

    Let's use S represents sample, F represents feature (gene), G represents geometric mean, n represents number of samples

    Merge matrices
    https://doi.org/10.1038/nature25981
    https://satijalab.org/signac/index.html
    Annotate Variants (VEP)
    Filter Variants
    Summarize Cohort Mutations
    Combine Variants
    Remove Background Mutations

    Welch's ANOVA

  • Kruskal-Wallis / Wilcoxon

  • Poisson /Negative binomial/GSA

  • Troubleshooting

  • DESeq2
    Hurdle model
    ANOVA/LIMMA-trend/LIMMA-voom
    https://doi.org/10.1038/s41592-019-0619-0
    https://satijalab.org/seurat/articles/weighted_nearest_neighbor_analysis.html
    https://www.nature.com/articles/s41467-024-44835-w
    Figure 1. Task settings
    Figure 2. Advanced settings
    Figure 3. Output node
    Figure 4. Output SVGs table generated by the task. The table shows the top 10 most spatially variable features ordered by PI score.
    Figure 5. A typical example of visualisation of SVGs results. The Data viewer session presented here includes 3 distinct SVGs plotted on the spatial map of the tissue, and the same spatial map colore by spatial domains (see BANKSY task)

    Click Finish to run

    K-means clustering
    K-means Clustering
    https://doi.org/10.1016/j.cell.2019.05.031
    1. Compute geometric mean of all the control genes (features) in each sample S individually, represented by GS1, GS2, GS3... to GSn.

    2. Compute geometric mean of the geometric means across all samples (GS1 to GSn), represented by GS

    3. Compute the scaling factor for each sample, S1=GS1/GS, S2=GS2/GS ... Sn=GSn/GS

    4. Normalize all the gene expression by divided by its sample scaling factor

    References

    1. Frank Speleman. Accurate normalization of real-time quantitative RT_PCR data by geometric averaging of multiple internal control genes. Genome Biology. 2002.

    select data from an ICA project
    Download Metadata example file for modification with attribute information
    Upload metadata file to an existing study
    SnpEff documentation
    Summarize cohort mutations
    SnpEff documentation
    Cell barcode QA/QC

    The task can be performed with or without the EmptyDrops method enabled.

    Cell Barcode QA/QC without EmptyDrops

    To perform the task without the EmptyDrops method enabled, leave the checkbox unchecked and click Finish.

    Note: Data imported from DRAGEN result is recommended to use this option since barcode with 0 counts are filtered out.

    The Cell barcode QA/QC task report is a plot. X-axis is the barcodes ranked by their UMI counts. Y-axis is the UMI counts in the barcode. This type of plot is often referred to as a knee plot.

    The knee plot is used to choose a cutoff point between barcodes that correspond to cells and barcodes that do not if the imported raw count data without any barcode filtering performed upstream. Connected Multiomics automatically calculates an inflection point, shown by the vertical line on the graph. Barcodes designated as cells are shown in blue while barcodes designated as without cells (background) are shown in grey.

    The cutoff can be adjusted by dragging the vertical line across the graph or by using the text fields in the Filter panel on the left-hand side of the plot. Using the Filter panel, you can specify the number of cells or the percentage of reads in cells and the cutoff point will be adjusted to match your criteria. The number of cells and the percentage of counts in cells is adjusted as the cutoff point is changed. To return to the automatically calculated cutoff, click Reset sample filter.

    The percentage of counts in cells and median counts per cell are useful technical quality metrics that can be consulted when optimizing sample handling, cell isolation techniques, and library preparation.

    One knee plot is generated for each sample. In projects with multiple samples, Next and Back buttons will appear at the top left of the plot, to enable navigation between sample knee plots. Manual filters must be set separately for each sample. This is typically used when the user expects a certain number of cells to be processed, like in experiments where droplets were loaded with a predefined number of cells.

    To return to the knee plot view, click Back to filter. To apply the filter and run the Filter barcodes task, click Apply filter. A Filtered counts data node will be generated.

    Cell Barcode QA/QC with EmptyDrops

    If your data has already been filtered to remove barcodes with low total counts, this method will not be suitable. This method requires empty barcodes to be present in the single cell count matrix, in order to estimate the ambient RNA profile.

    The EmptyDrops method (1) uses a statistical test to identify which barcodes correspond to real cells and empty droplets. An ambient RNA expression profile is estimated from barcodes below a specified total UMI count threshold, using the Good-Turing algorithm. The expression profile of each barcode above the low-count threshold is then tested for deviations from the ambient profile. Real cells are expected to have a low p-value, indicating a significant deviation from the expected background noise level. False discovery rate (FDR) correction is applied to all the p-values and those falling equal to or below the specified FDR level are detected as real cells. This can allow for the detection of additional cells that would otherwise be discarded due to a low total UMI count.

    In addition, a knee point threshold will be calculated to identify cells with a very high total UMI count. It's possible that some barcodes with a high total UMI count will not pass the EmptyDrops significance test. This could be due to biases in the ambient RNA profile, leading to a non-significant difference between a barcode's expression profile vs the ambient profile. To protect against this issue, it is advisable to use the EmptyDrops results in conjunction with the knee point filter, on the assumption that barcodes with a very high total UMI count will always correspond to real cells. Note, the knee point will be more conservative than the inflection point calculated by Connected Multiomics when the EmptyDrops method is not enabled.

    To perform the task with the EmptyDrops method, check the checkbox, configure the additional options, and click Finish.

    Ambient count threshold

    Barcodes with a total UMI count equal to or below this threshold will be used to create the ambient RNA expression profile to estimate background noise. The default is set to 100, which is reasonable for most data.

    FDR threshold

    Barcodes equal to or below this FDR threshold show a significant deviation from the ambient profile and can therefore be considered real cells. Increasing this value will result in more cells, but will also increase the number of potential false positives.

    Random generator seed

    This is used for performing Monte Carlo simulations to determine p-values. To reproduce results, use the same random seed for all runs.

    There are additional metrics on the left of the plot in the report.

    The number of actual cells detected by the EmptyDrops test and the knee point filter are shown above the Venn diagram on the left. In the above example plot 3,189 barcodes are above the knee point filter (represented by the vertical blue line on the plot) and 2,657 barcodes passed the significance test in EmptyDrops. The overlap between these sets of barcodes is represented by the Venn diagram.There are 1,583 barcodes pass the significance test in EmptyDrops and have a high total UMI count above the knee point filter; 1,606 barcodes have a very high total UMI count with no significant difference from the ambient profile in EmptyDrops; 1,074 barcodes fall below the knee point but are still significantly different from the ambient profile.

    The number of cells included by the knee point filter can be adjusted either by click on the plot to change the position of the vertical blue line or by typing a different number of cells into the text box on the left.

    The total number of cells is shown in the text box on the left. By default, this will be all of the cells detected by the knee point filter plus the extra cells detected by EmptyDrops. In the example, there are 3,189 cells with a high total UMI count plus the additional 1,074 cells from EmptyDrops (total = 4,263).

    Different sections of the Venn diagram can be selected/deselected to include/exclude barcodes. For example, clicking the '1,606' section of the Venn diagram will deselect those barcodes. Now, the only cells that will pass the filter will be the significant ones from EmptyDrops.

    References

    1. Lun, A., Riesenfeld, S., Andrews, T. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 2019; 20: 63.

    Select Region analysis from context-sensitive menu, select Annotate regions.

  • Select an Assembly and Annotation model.

  • Select an option for Genomics overlaps:

    • Report one gene region per genomic feature (precedence applies): chooses one gene section for each region using the precedence order when more than one gene section overlaps a region. The order of precedence is TSS, TTS, CDS Exon, 5' UTR Exon, 3' UTR Exon, Intron, Intergenic.

    • Report all gene regions per features: creates a row for each gene section that overlaps a region in the task report.

  • Define transcription start site (TSS) and transcription termination site (TTS) limit in the unit of bp.

  • Click Finish to run.

  • Annotate regions output

    When completed, an Annotate regions task node and an Annotated regions data node are generated. Double-click on the Annotated regions data node to open task report. The report consists of a pie chart and a table. The pie chart shows breakdown on gene section among the regions. The table show annotated information, if run with the option to report all gene sections per region, each region will have a row for each gene section it overlaps; if run with the option to report one gene section per region, each region will have one row with the gene section it overlaps chosen using the order of precedence. The table can be sorted by any of its columns.

    Click on the Optional columns on the upper-left corner of the table to add more information on each region.

    Gene sections

    TSS

    Transcription start site (TSS) is -1000bp and +1000bp (default setting) from the TSS for a transcript

    TTS

    Transcription termination site (TTS) is -1000bp and +1000bp (default setting) from the TTS for a transcript

    CDS Exon

    Coding sequence (CDS) Exon is overlapping a coding exon in a transcript

    5' UTR Exon

    5' Untranslated Region (UTR) Exon is overlapping an exon in the 5' UTR of a transcript

    3' UTR Exon

    3' Untranslated Region (UTR) Exon is overlapping an exon in the 3' UTR of a transcript

    Intron

    Intron is overlapping an intron in a transcript

    Intergenic

    Intergenic is not located within 1000bp of a transcript

    section of the toolbox
  • Click General linear model

  • The batch effect removal dialog is similar to the dialog for ANOVA. To set up the model, you need to choose factors and interaction of factors based on your experiment design in addition to attribute(s) represent batch you would like to remove.

    For example, in the case where you have samples from different cell types, treatment, they are from different batches, batch may have a different effect on different cell types and/or different treatment, you would need to include both treatment, cell type, batch and maybe the interaction between treatment and cell type in the model.

    • Select the Remove checkbox for batch

    • Click Finish to run

    The output of is a new data node, contains the batch effect corrected values, but the treatment and cell type variation is still kept. It can be used as the input for downstream tasks

    Transcript counts

  • Normalized counts

  • After clicking on the chosen node:

    • Select Statistics > Differential analysis in the context-sensitive menu.

    • Select Welch's ANOVA.

    • Select a specific factor for analysis and click the Next button to setup the comparisons.

    • Define the comparisons by dragging and dropping each group in the Denominator and Numerator boxes and click Add comparison

    If the data has not been filtered upstream the Low value filter box will be checked by default. Similarly, the Default normalisation will be selected if the software detects that the data as not been previously normalised.

    Advanced option

    Advanced options configuration allows to choose different multiple test correct and report options

    The results of the analysis will appear similar to other ANOVA/LIMMA-trend/LIMMA-voom. However, the column to indicate mean expression levels for each group will display the median instead for Kruskal-Wallis.

    PCA task can be performed on observation X features matrix data node, e.g raw counts, normalized counts data node etc.. Select a node on and click PCA in the Exploratory analysis section of the context sensitive menu.

    Features to include in calculation

    You don't have to use all the features to compute PCs for the observations, especially when the input matrix is very large, e.g. scRNA-seq data, this option allows you to choose subset of features based on a selected statistics, the default is using the top 2000 features with the highest variance to compute PCs.

    Number of PCs to calculate

    When the matrix is large like single cell data, depends on what you would like to do downstream with PCA output, you don't have to compute all the PCs. Choose less number of PCs will reduce the running time of this task. By default, it output the top 100 PCs

    Features contribute

    equally: all the features are standardized to mean of 0 and standard deviation of 1 . This option will give all the features equal weight in the analysis, this is the default option for e.g bulk RNA-seq data.

    by variance: the analysis will give more emphasis to the features with higher variances. This is the default option for e.g. single cell RNA-seq data.

    The PCA task creates a new task node, and to open it and see the result, do one of the following: select the PCA task node, proceed to the context sensitive menu and go to the Task result; or double-click on the PCA task node.

    When the PCA node is opened in Data viewer, by default, it contains a scatterplot, Scree plot with Eigenvalues, and Component loadings table. Each dot on the scatter plot represents an observation, while the first three PCs are shown on the X-, Y-, and Z-axis respectively, with the information content of an individual PC is in the parenthesis.

    As an exploratory tool, the the PCA scatterplot is applied to view clustering patterns in the data and generate hypotheses based on the outcome, or to spot possible outliers.

    To rotate the 3D scatter plot left click & drag. To zoom in or out, use the mouse wheel. Click and drag the legend can move the legend to different location on the viewer.

    Detailed configuration on PCA plot can be found at Data viewer section.

    In the Data viewer, when a PCA data node is selected from Get Data under Setup (left panel), the node can be dragged and dropped to the screen, then you will have the option to select a scree plot and tables.

    When choose Scree plot icon , it will plot a 2D viewer, X-axis represents PCs, Y-axis represents eigenvalues. When mouse over on a point on the line, it will display detailed information of the PC. The scree plot shows how much variation each PC represents, so it is often used to determine the number of principal components to keep for downstream analysis (e.g. tSNE, UMAP, graph-base clustering). The "elbow" point of the graph where the eigenvalues seem to level off should be considered as a cutoff point for downstream analysis.

    PCA data node can also be draw as tables, when choose Table icon , it will display the component loadings matrix in the viewer. The Content can be modified using the Content configuration option; the table can be paged through here or from the lower right corner.

    In the table, each row is a feature, the column represent PCs, the value is the correlation coefficient. Under Content, there is a PCA projections option, change to this option to display the projection table. In this table, each row is an observation, each column is a PC, the values are the PC scores.

    Gene counts

  • Transcript counts

  • Normalized counts

  • After clicking on the chosen node:

    • Select Statistics > Differential analysis in the context-sensitive menu.

    • Select Kruskal-Wallis or Wilcoxon.

    • Select a specific factor for analysis and click the Next button to setup the comparisons.

    Note: Wilcoxon test will filter the data to include the observations in the two comparison groups to generate p-value, while Kruskal-Wallis will use all the samples in the input data to generate p-value on the selected attribute.

    • Define the comparisons by dragging and dropping each group in the Denominator and Numerator boxes and click Add comparison

    If the data has not been filtered upstream the Low value filter box will be checked by default. Similarly, the Default normalisation will be selected if the software detects that the data as not been previously normalised.

    Advanced option

    If there are tied ranks of feature expression values, the default is not use tie correction which is corresponding to the scanpy.tl.rank_genes_groups(tie_correct = False).

    The results of the analysis will appear similar to other ANOVA/LIMMA-trend/LIMMA-voom. However, the column to indicate mean expression levels for each group will display the median instead for Kruskal-Wallis.

    scany Wilcoxon method

    Click the QA/QC section of the toolbox

  • Click Feature distribution

  • A new task node is generated with the Feature distribution report.

    Feature distribution plot configuration

    The Feature distribution task report plots the distribution of all features (genes or proteins) in the input data node with one feature per row. Features are ordered by average value in descending order.

    The plot can be configured using the panel of the left-hand side of the page.

    Filter

    Using the filter, you can choose which features are shown in the task report.

    The Manual filter lets you type a feature ID (such as a protein ID) and filter to matching features by clicking + . You can add multiple feature IDs to filter to multiple features.

    The List filter lets you filter to the features included in a feature list. To learn more about feature lists, please see List management.

    Plot type

    Distributions can be plotted as histograms, which is the default setting, with the x-axis being the expression value and the y-axis the frequency, or as a strip plot, where the x-axis is the expression value and the position of each cell/sample is shown as a thin vertical line, or strip, on the plot.

    To switch between plot types, use the Plot type radio buttons.

    Mousing over a dot in the histogram plot gives the range of feature values that are being binned to generate the dot and the number of cells/samples for that bin in a pop-up.

    Mousing over a strip shows the sample ID and feature value in a pop-up. If there are multiple cells/samples with the same value, only one strip will be visible for those cells/samples and the mouse-over will indicate how many cells/samples are represented by that one strip.

    Clicking a strip will highlight that cell/sample in all of the plots on the page. The grey dot in each strip plot shows the median value for that feature. To view the median value, mouse over the dot.

    Page

    To navigate between pages, use the Previous and Next buttons or type the page number in the text field and click Enter on your keyboard.

    The number of features that appear in the plot on each page is set by the Items per page drop-down menu. You can choose to show 10, 25, or 50 features per page.

    Color by

    You can add attribute information to the plots using the Color by drop-down menu.

    For histogram plots, the histograms will be split and colored by the levels of the selected attribute. You can choose any categorical attribute.

    For strip plots, the sample/cell strips will be colored by the levels or values of the selected attribute. You can choose any categorical or numeric attribute.

    VST v2: Default is TRUE. When set to 'v2', it sets method = glmGamPoi_offset, n_cells=2000, and exclude_poisson = TRUE which causes the model to learn theta and intercept only besides excluding poisson genes from learning and regularization; If this option is unchecked, it uses the original sctransform model (v1), it will only generate SC scaled data node.

    There are two data nodes generated from this task (if VST v2 option is checked as default):

    SC scaled data: it is a matrix of normalized values (residuals) that by default has the same size as the input data set. This data node is used to perform downstream exploratory analysis e.g. PCA, Seurat3 integration etc., this data node is not recommend to use for differential analysis.

    SC corrected data: is equivalent to the ‘corrected counts’ in data slot generated after PrepSCTFindMarkers task in the SCT assay in Seurat object. It is used for downstream differential expression(DE) analyses.

    References

    1. Christoph Hafemeister, Rahul Satija. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. https://doi.org/10.1101/576827

    2. SCTransform() documentation https://www.rdocumentation.org/packages/Seurat/versions/3.1.4/topics/SCTransform

    3. Saket Choudhary, Rahul Satija. Comparison and evaluation of statistical error models for scRNA-seq. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02584-9

    Categorical and numeric attributes, as well as interaction terms can be added to the DESeq2 model. The DESeq2 configuration dialog for adding attributes and interactions to the model is very similar to the ANOVA configuration dialog.

    In DESeq2 advanced options configure dialog, there is reference selection option:

    A reference level is specified for each categorical factor in the model and the result may be dependent on the choice. In R, the reference level is typically chosen by default whenever a categorical factor is present in the model. This Connected Multiomics option was created to allow the user to specify exactly the same reference level as in the R script, if need be e.g. compare the results with R.

    DESeq2

    The report produced by DESeq2 is similar to the ANOVA report; each row is a feature and columns include p-value, FDR p-value, and fold change in linear scale for each contrast.

    Fold change shrinkage in DESeq2

    In R, shrinkage of log2 fold changes is a separate step performed by lfcShrink() function. In Connected Multiomics, it mplements the shrinkage method corresponding to “ashr” option in lfcShrink(). The default shrinkage option in lfcShrink is “apeglm”, but the default method is unable produce results for some comparisons whereas “ashr” has no restrictions. The fold change shrinkage results are produced in “Shrunken Log2(Ratio)” and “s-value” columns in DESeq2 task project report.

    Troubleshooting

    In addition to the issues addressed in Differential Analysis, DESeq2 may generate missing values in the multiplicity adjustment columns (such as FDR) if "independent filtering" is enabled in Advanced Options:

    "Independent filtering" tries removing some features with low expression in order to increase the statistical power. For such removed features, the p-value is reported but FDR and similar multiplicity adjustment measures are set to "?". In order to avoid the missing values in the report, set the option to "No".

    References

    Love MI, Huber W, and Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 2014;15(12): 550.

    DESeq2 documentation page

    Running AUCell

    AUCell can be run on any single cell counts data node.

    • Click the single cell counts data node

    • Click the Exploratory analysis section in the toolbox

    • Click AUCell

    • Choose gene lists by clicking and dragging them to the panel on the right or clicking the plus (+) that appears after mousing over a gene list

    • Click Finish to run

    AUCell produces an AUCell result data node. The AUCell result data node includes the input counts data and adds the AUCell scores to the original data as a new data type, AUCell Values. AUCell values for each input feature list are included as features in the AUCell result data node. These features created by AUCell are named after the feature list (e.g., B cells).

    Because the AUCell values are added as features, they can be used as input for clustering, differential analysis, and visualization tasks.

    To produce a data node containing only the AUCell values, choose Pre-analysis tools > Split by feature type from the pop-up menu to split the AUCell result data node into separate data nodes for each of its data types. This can be helpful if you intend on performing downstream analysis on the AUCell values only. To perform differential analysis, it is advisable to normalize the values by adding a small offset (e.g. 1E-9) and Logit transformation to the base Log2 using the Normalization task. This will make the values continuous and suitable for differential analysis with methods such as ANOVA/LIMMA-trend/LIMMA-voom, Non-parametric ANOVA or Welch's ANOVA. For differential analysis, please check the Low-value filter is set to None and the values are correctly recognized as Log2 transformed in the Advanced settings.

    If an AUCell result data node or other downstream data node containing AUCell Values is used as the input for AUCell, the additional AUCell values will be added as additional features of the AUCell values data type in the new AUCell result data node.

    Advanced AUCell parameters

    Minimum gene set size

    For each gene set, AUCell computes the intersection between the gene list and the input data set. If the intersection size is below the specified threshold, the gene set is ignored and no AUCell score is calculated for it. Default is 5.

    Top features in percentiles

    To calculate the AUCell value, genes are ranked and the fraction of genes from the gene list that are above the percentile cutoff is used in the computation. This parameter sets the percentile cutoff. Default is 5.

    References

    [1] Aibar, S., González-Blas, C. B., Moerman, T., Imrichova, H., Hulselmans, G., Rambow, F., ... & Atak, Z. K. (2017). SCENIC: single-cell regulatory network inference and clustering. Nature methods, 14(11), 1083.

    AUCell

    PCA

  • t-SNE

  • UMAP

  • Hierarchical Clustering

  • AUCell

  • Find multimodal neighbors

  • SVD

  • CellPhoneDB

  • Graph-based Clustering
    K-means Clustering
    Compare Clusters

    t-SNE task can run on any counts data node, however, it is very computationally intensive, we recommend running PCA first and run this task on PC output data node using the top few number of PCs.

    • Click the counts data node or PCA data node (recommended)

    • Click the Exploratory analysis section of the toolbox

    • Click t-SNE

    • Click Finish to run

    PCs to use

    Choose how many PCs to use, default is the top 20 PCs. Low number of PCs will reduce the run time.

    Split cells by sample

    Chose whether to run t-SNE on all samples together or on each sample individually.

    Checking the box will run t-SNE on each sample individually.

    Advanced options configure

    Perplexity

    t-SNE preserves the local structure of the data by focusing on the distances between each point and its nearest neighbors. Perplexity can be thought of as the number of nearest neighbors being considered. The optimal perplexity depends on the size and density of the data. Generally, a larger and/or more dense data set will benefit from a higher perplexity. Default is 30. The range of possible values is 3 to 100.

    Number of iterations

    t-SNE uses an iterative algorithm to optimize the low-dimensional representation. More iterations will result in a more accurate embedding to an extent, but will take longer to run. Default is 1000.

    Random generator seed

    Several parts of t-SNE utilize a random number generator to provide an initial value. Default is 1. To reproduce the results, use the same random seed at all runs.

    Initialize output values at random

    If selected, t-SNE initializes from random initial positions for each point. If disabled, the initial values for each point are assigned using the largest principal components extracted from the raw data. Default is enabled.

    Number of iterations

    To minimize the descrepancy between high and low dimensional distance among the points, t-SNE uses an iterative algorithm. A larger number of iteration will result in a more accurate low dimensional representation, with longer running time. Default is 1000.

    Distance metric

    The metric to use when computing distances in high-dimensional space. Options are Euclidean, Manhattan, Chebyshev, Canberra, Bray Curtis, and Cosine. Default is Euclidean.

    Learning rate

    Changing the learning rate may improve the visualization. For instance, if the optimization procedure is stuck, one can try increase the rate. However, too high or too low will ultimately deteriorate the result.

    t-SNE produces a t-SNE task node. Opening the task report launches a scatter plot showing the t-SNE results. Each point on the plot is a cell for single cell data or a sample for bulk data. The plot will open in 2D or 3D depending on the user preference.

    References

    [1] L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008.

  • Click Scran deconvolution

  • Select the cluster name from a drop-down list that includes all the attributes for this dataset. The selected cluster is an optional factor specifying which cells belong to which cluster, for deconvolution within clusters.

    If Cluster name is not checked, all the cells in the input data will be used as one cluster for size factor estimation.

    Click the Finish button to run the task.

    Other parameters in this task that you can adjust include:

    Pool size: A numeric vector of pool sizes, i.e., number of cells per pool.

    Max cluster size: An integer scalar specifying the maximum number of cells in each cluster.

    Enforce positive estimates: A logical scalar indicating whether linear inverse models should be used to enforce positive estimates.

    Scaling factor: A numeric scalar containing scaling factors to adjust the counts prior to computing size factors.

    The output of Scran deconvolution is a new data node that has been normalized by the pool-based size factors of each cell and log2 transformed. We can then use this new normalized matrix for downstream analysis and visualization.

    References

    1. Lun, A. T., K. Bach, and J. C. Marioni. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016.

    https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0947-7

    scran

    Selecting Include region overlapping variants will bring up a dialog to include variants located within genomic regions of interest, which could be regions such as transcript models or amplicons. If variant detection was performed in the Connected Multiomics, the Assembly will be displayed as text in the section, and you do not have the option to change the reference. In the event that variant detection was performed outside of the Connected Multiomics, you will need to select the appropriate Assembly utilized for variant detection in the drop-down list. Assemblies previously added will be available for selection or New assembly… can be utilized to import the reference sequence from within the task. The Annotation model section will allow for the use of any annotation model in the drop-down menu or can be imported from within the task by selecting Add annotation model. If an annotation that contains gene-level information is selected, this filter will include both intronic and exonic regions.

    If the filter is invoked from an Annotated variants data node, the Variant Novelty section can be utilized to filter known variants as identified in a variant database used for annotation. Selecting Known only, Novel only, or All will include only these types of variants in the resulting filtered variants.

    Variants annotated with a transcript model will include a filter for Variant Type. For variants in coding regions, Mutation type allows for the inclusion of Synonymous, Missense, and/or Nonsense variants when selecting the appropriate type. For variants located outside of coding regions, Feature section allows for the inclusion of 5-prime splice site (Splice-5), 3-prime splice site (Splice-3), Non-coding RNA, 5-prime UTR, 3-prime UTR, Intron, Promoter, and/or Intergenic variants by selecting the appropriate type.

    When filter by field option is checked, all of the fields can be displayed in the drop-down list, depends variant detection algorithms, annotation database, etc, the list of the fields will be different from different data node.

    For instance VarQual field is a metrics generated from the variant detection, and these will be dependent upon the method utilized for variant detection.

    Field can be searched from the drop-down list, when mouse over on a field, description of the field will be displayed.

    Decisions on quality filtering parameters should be based upon sequencing assay design as well as goal or the study, either identification of all potential variants or identification of high confidence variants. At the very least, the use of Minimum read depth should be considered for filtering to ensure sufficient read evidence was available to call a variant. In instances where paired variant detection was performed in SAMtools, Minimum genotype log ratio may be employed to ensure sufficient evidence of genotype differences in case and control sample pairs. Please refer to the Samtools, FreeBayes, and LoFreq documentation for further details on any of these parameters.

    data node.
  • Select Region analysis from context-sensitive menu, select Get Regional Methylation.

  • Select an annotation model from Assembly and Annotation model drop-down, then click Finish.

  • Using a custom regions file

    When you have a list of regions of interest, you may upload a custom region bed file for the Get Regional Methylation task. The custom regions file should:

    • Consists of 3 columns: chromosome, start position, and stop position. The positions should be 0-based coordinates.

    • No columns name.

    • Tab-delimited.

    • Saved as .bed file.

    After the custom region file is prepared, following these steps to apply it on the Get Regional Methylation task:

    • Click on 5-base Methylation data node.

    • Select Region analysis from context-sensitive menu, select Get Regional Methylation.

    • Select an Assembly, and select Add Annotation model from Annotation model drop-down.

    • In Add annotation file dialog, select Add Annotation model from Annotation model drop-down, type in a name for the custom annatation, select Import annotation file, then click Create.

    • At Select file page, select My Computer.

    • Click + Choose button to choose a .bed file from your computer.

    • The selected file name will be displayed, click Next.

    • Select Other for Annotation type, click Next.

    • Click Finish to run Get Regional Methylation based on the selected custom regions file.

    . If variant detection was performed in Connected Multiomics, the
    Assembly
    will be displayed as text in the section, and you do not have the option to change the reference. In the event that variant detection was performed outside of Connected Multiomics, you will need to select the appropriate Assembly utilized for variant detection in the drop-down list. Assemblies previously added will be available for selection or
    New assembly…
    can be utilized to import the reference sequence from within the task.

    Selecting Annotate with genomic feature provides the means to add gene/feature information to the variants. This typically takes the form of overlaying a transcript model (such as Ensembl). Annotation models previously added will be available for selection or Add annotation model in the drop-down list can be utilized to import an annotation model within the task. Promoter upstream limit and Promoter downstream limit provides a means to set the number of bases flanking the transcription start site, and this region will considered the promoter of a feature.

    Selecting Annotate with known variants will provide the ability to specify a Variant annotation database. Known variant databases in vcf format, such as dbSNP [1] and 1000 Genomes [2] for human variants, can be used in the task. Additional databases not provided for automated download in Connected Multiomics, such as the Catalogue of Somatic Mutations in Cancer (COSMIC) [3], can be obtained and employed by the user. Variant databases previously added will be available for selection or Add variant database in the menu can be utilized to import the variant database to library files from within the task.

    References

    1. Sherry ST. dbSNP: the NCBI database of genetic variation. Nucleic Acids Research. 2001;29(1):308-311. doi:10.1093/nar/29.1.308

    2. Auton A, Abecasis GR, Altshuler DM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68-74. doi:10.1038/nature15393.

    3. Forbes SA, Bhamra G, Bamford S, et al. The Catalogue of Somatic Mutations in Cancer (COSMIC). In: Haines JL, Korf BR, Morton CC, Seidman CE, Seidman JG, Smith DR, eds. Current Protocols in Human Genetics. Hoboken, NJ, USA: John Wiley & Sons, Inc.; 2008. http://doi.wiley.com/10.1002/0471142905.hg1011s57.

    The task dialog for Annotate variants (VEP) contains two sections: Select Variant Effect Predictor database and Advanced options. Select Variant Effect Predictor database will specify the reference assembly to utilize for variant detection. If the variant detection was performed in Connected Multiomics, the Assembly will be displayed as text in the section. Upon initial task usage, click the Create variant effect predictor database button to import a database. The VEP database for hg19 is available for automated download in Connected Multiomics, and information regarding obtaining additional databases for other species and genome assemblies can be found in the VEP documentation.

    Advanced options provides a means to specify aspects of the annotation generated from the VEP annotation task. Upon invoking the task dialog, Option set is set to Default. Clicking Configure will open a window to specify additional components of annotation. VEP has Advanced options for Identifiers, Output options, and Co-located variants. Moving the mouse cursor over the info button will provide details for each parameter.

    In the report, there variant impact information, it is a subjective classification of the severity of the variant consequence:

    • Low: a variant that is assumed to be mostly harmless or unlikely to change protein behavior.

    • Moderate: a non-disruptive variant that might change protein effectiveness.

    • Modifier: usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact.

    • High: a variant is assumed to have high disruptive impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay.

    References

    1. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The Ensembl Variant Effect Predictor. Genome Biology Jun 6;17(1):122. (2016) doi:10.1186/s13059-016-0974-4

    2. Ng PC, Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003 Jul 1;31(13):3812-4. doi: 10.1093/nar/gkg509.

    3. Choi Y, Sims GE, Murphy S, Miller JR, Chan AP. Predicting the functional effect of amino acid substitutions and indels. PLoS One. 2012;7(10):e46688. doi: 10.1371/journal.pone.0046688.

    Annotate Variants
    include
    or
    exclude
    based on the specified criteria.

    The second drop-down menu allows you to choose any categorical or numeric attribute to use for the filter criteria.

    If the attribute is categorical, the third drop-down menu includes in and not in as options. A fourth drop-down menu allows you to search and choose from the levels of the selected attribute.

    If the attribute is numeric, the the third drop-down includes:

    • <: less than

    • <=: less than or equal to

    • == equal to

    • >: greater than

    • >=: greater than or equal to

    The threshold is set using the text box. The input must be a number; it can be an integer or decimal, positive or negative.

    Using the OR and AND options, you can combine multiple filters.

    When combining multiple filters all set to Include:

    With AND, if all statements must be true for the sample to meet the filter criteria.

    With OR, if any statement is true, the sample will meet the filter criteria.

    When combining multiple filters all set to Exclude:

    With AND, if any statement is true, the sample will meet the filter criteria.

    With OR, all statements must be true for the sample to meet the filter criteria.

    Filter by features

    You can use a certain feature expression values to generate a subset of the data. For instance, if you want to include all the samples/cells whose GAPDH gene expression value is greater or equal to 5:

    You can search for a feature by typing in the search box in the 2nd drop-down list.

    The output of the task will contain a data node with the same features as the input data but only the observations meet the filter criteria.

    Data Inputs

    To have data to be used as input for Connected Multiomics, data can be:

    • Generated from a DRAGEN pipeline

    • Added from ICA project

    • Uploaded from local drive

    Use data from DRAGEN pipeline

    Connected Multiomics uses data from DRAGEN pipelines within the Illumina Connected ecosystem.

    When kicking off DRAGEN pipelines in BaseSpace, ensure the same workgroup is used in BaseSpace as Connected Multiomics so that the DRAGEN outputs are visible to Connected Multiomics. To change the workgroup on BaseSpace, select the desired workgroup in the top right-hand corner.

    If DRAGEN pipeline are kicked off in ICA or in BaseSpace using Personal, these results can be shared with the workgroup by adjusting the permissions set as indicated in the section.

    Add data from ICA project

    Connected Multiomics inputs are the results of secondary analysis pipelines that are stored on projects in (ICA). Secondary analysis results may be generated from , run manually or come from such as legacy pipelines or commercial applications.

    To access ICA in your software, select the Illumina Connected Analytics application tile from your Product Dashboard. For guidance on uploading and managing data in ICA, please refer to the instructions . For data types and files format supported for analysis in the Connected Multiomics, please refer .

    To ensure users can proceed smoothly with their data exploration in Connected Multiomics, the workgroup that is in use in Connected Multiomics needs to be added to the ICA Project(s) at settings, with the following permissions set:

    • Contributor or higher

    • permissions

    • permissions

    After data files are uploaded and workgroup is added, you may click on the Product Dashboard icon to navigate to Illumina Connected Multiomics to start working on your data.

    Add data from local drive

    In a Connected Multiomics study, click on Add Data and choose Upload local files

    There are two options:

    • Upload files and add to study: the wizard will guide you create samples in a study after the files are uploaded

    • Upload files on: files will be stored in an ICA project. If you want to create samples from those files later, you need to choose "Select from ICA project" when you create samples

    • When choosing the first option, data type needs to be specified

    • From the data type drop-down list, specify technology, assay from the menu and the supported file types will be displayed

    Select file format and click Continue

    The files will be uploaded to ICA, specify an ICA project folder and drag & drop files from your local device or click on the Browse Files button to select files, click Upload to ICA

    A progress bar will be displayed during uploading, once it is done, the status is displayed as Uploaded

    Files will be uploaded to the folder of the ICA project name, subfolder "uploads" and ICM project name subfolder

    Click on Ingest to Study to generate samples in the study.

    Transfer data from public BaseSpace

    If the data is in BaseSpace in public domain, see on how to transfer data to an enterprise domain.

    5-base Methylation QC

    The 5-base methylation QC task in the Connected Multiomics enables you to visualize sample-level QC metrics that describe reads mapping quality and CpG methylation calling. The QC metrics are extracted from the DRAGEN analysis metric files that were ingested into the study as required files for 5-base DNA Prep data analysis in the Connected Multiomics. To invoke the 5-base methylation QC task:

    • At Analyses page, click on the 5-base Methylation node.

    • Click QA/QC section in the context-sensitive task menu on the right.

    • Click 5-base methylation QC.

    There is no parameters setting required for the 5-base methylation QC task. After click on the 5-base methylation QC task from the context-sensitive task menu, a task node called 5-base methylation QC report is initiated. When completed, double-click on the 5-base methylation QC report task node to open the QC report in a data viewer. The QC report consists of plots and tables organized in 2 sheets. Click on sheet name at the bottom of the data viewer to navigate from one sheet to another.

    Metrics

    Sheet Metrics shows sample-level QC metrics plot. Each sample is a data point, they are randomly spead out on x-axis. The QC metric is represented by y-axis. Each plot is overlay with a violin plot to show distribution of the QC metrics.

    • Percent methylation in samples: Percentages of CpG methylation in samples.

    • Percent methylation in unmethylated control: Percentage of CpG methylation in the unmethylated control (lambda). Low value indicates good quality.

    • Percent methylation in methylated control: Percentage of CpG methylation in the methylated control (pUC19). High value indicates good quality.

    • Percent duplicate reads: Percentage of duplicate marked reads, as a result of PCR amplification.

    In Metric sheet, samples can be selected using Selection > Select & Filter. The Select & Filter dialog is pre-loaded with the selection criteria, one for each QC metric.

    Hovering the mouse over one of the selection criteria reveals a histogram showing you the frequency distribution of the respective QC metric. The minimum and maximum thresholds can be adjusted by clicking and dragging the sliders or by typing directly into the text boxes for each selection criteria.

    Adjusting the selection criteria will select and deselect samples in all 6 plots simultaneously. Depending on your settings, the deselected points will either be dimmed or gray. The filters are additive. Combining multiple filters will include the intersection of the the filters. The number of samples selected is shown in the figure legend of each plot.

    To filter the dataset to the selected samples, click the include selected points icon ( ) in Filter in the top right of Select & Filter, and click Apply observation filter...

    Select the input data node for the filtering task and click Select.

    A new data node, Filtered samples, will be generated under the Analyses tab.

    M-bias

    Sheet M-bias shows M-bias plots for methylation level and coverage across positions on read1 and read2. The M-bias should be consistent across all positions. It is common for the first/last 10 bases to have un-even methylation due to end-repair and sequencing artifacts.

    All plots in one data viewer screen can be downloaded into local computer as a single image by clicking Export button on the top of the screen. To download an individual plot into local computer, select the plot, click Plot button from the left panel within the plot, then click Export, follow the wizard to set image file format, image size, and resolution.

    K-means clustering

    What is K-means clustering?

    K-means clustering is a method for identifying groups of similar observations, i.e. cells or samples. K-means clustering aims to group observations into a pre-determined number of clusters (k) so that each observation belongs to the cluster with the nearest mean. An important aspect of K-means clustering is that it expects clusters to be of similar size (equal variance) and shape (distribution of variance is spherical). The Compare Clusters task can also be used to help determine the optimal number of K-means clusters.

    Running K-means clustering

    We recommend normalizing your data prior to running K-means clustering, but the task will run on any counts data node.

    • Click the counts data node

    • Click the Exploratory analysis section of the toolbox

    • Click K-means clustering

    • Configure the parameters

    Distance metric

    Choose which distance metric to use for cluster distance calculations. Options include Euclidean, Absolute Value, Euclidean Squared, Kendall Correlation, Max Value, Min Value, Pearson Correlation, Rank Correlation, Average Euclidean, Shape, Cosine, Canberra, Bray Curtis, Tanimoto, Pearson Correlation Absolute, Rank Correlation Absolute, and Kendall Correlation Absolute. The default is Euclidean.

    Number of clusters

    Choose between specifying a set number of clusters or a range to test for the best fit number of clusters. The best fit is determined by the number of clusters with the lowest Davies–Bouldin index. The default is set to 10 for a fixed number of clusters. The initial values for the range option are 3 to 20 clusters. When this option is selected, will be performed.

    Compute biomarkers

    Compute biomarkers task will be performed after this task run if the checkbox is selected. The attribute used in is the k-means cluster annotation.

    Split cells by sample

    This option is present in single cell data. If enabled, K-means clustering will be run separately for each sample. If disabled, K-means clustering will be run on all cells from the input data. Default is set by the Split single cell by sample option in the user preference page.

    K-means clustering produces a K-means Clusters result data node; double-click to open the task report which lists the cluster statistics. If clustering was run with Split by sample enabled on a single cell counts data node, the cluster results table displays the number of clusters found for each sample and clicking the sample name opens the sample-level report.

    The total number of clusters is listed along with the number and percentage of cells in each cluster.

    The K-means Clustering result data node includes the input values and adds cluster assignment as a new attribute, K-means, for each observation.

    Advanced K-means clustering parameters

    Random cluster initialization

    If enabled, the initial cluster centroids will be selected randomly from among the data points. If disabled, the initial cluster centroids will be selected to optimize distance between clusters. Default is Disabled.

    Random seed

    This sets the random seed used if Random cluster initialization is enabled. Use the same random seed to reproduce results.

    Batch centroid computations

    If enabled, all cluster centroids will be recomputed at the end of each iteration. If disabled, each cluster centroid will be recomputed as the members of the cluster change. Default is Enabled.

    Max iterations

    The maximum number of iterations to perform before setting on a set of clusters. Default is 1000.

    Monocle 2

    In Connected Multiomics, we use tools from Monocle 2 [1] to build trajectories, identify states and branch points, and calculate pseudotime values. The output of Trajectory analysis includes an interactive scatter plot visualization for viewing the trajectory and setting the root state (starting point of the trajectory) and adds a categorical cell level attribute, State. From the Trajectory analysis task report, you can run a second task, Calculate pseudotime, which adds a numeric cell-level attribute, Pseudotime, calculated using the chosen root state. Using the state and pseudotime attributes, you can perform downstream analysis to identify genes that change over pseudotime and characterize branch points.

    Prerequisites for trajectory analysis

    Note that trajectory analysis will only work on data with <600,000,000 elements in the matrix (number of cells × number of features). If your data set exceeds this limit, the Trajectory analysis task will not appear in the toolbox. Prior to performing trajectory analysis, you should:

    1) Normalize the data

    Trajectory analysis requires normalized counts as the input data. We recommend our default "CPM, Add 1, Log 2" normalization for most scRNA-Seq data. For alternative normalization methods, see our documentation.

    2) Filter to cells that belong in the same trajectory

    Trajectory analysis will build a single branching trajectory for all input cells. Consequently, only cells that share the biological process being studied should be included. For example, a trajectory describing progression through T cell activation should not include monocytes that do not undergo T cell activation. To learn more about filtering, please see our documentation.

    3) Filter to genes that characterize the trajectory

    The trajectory should be built using a set of genes that increase or decrease as a function of progression through the biological processes being modeled. One example is using differentially expressed genes between cells collected at the beginning of the process to cells collected at the end of the process. If you have no prior knowledge about the process being studied, you can try identifying genes that are differentially expressed between clusters of cells or genes that are highly variable within the data set. Generally, you should try to filter to 1,000 to 3,000 informative genes prior to performing trajectory analysis. The list manager functionality is useful for creating a list of genes to use in the filter. To learn more, please see our documentation on .

    Parameters

    Dimensionality of the reduced space

    While the trajectory is always visualized in a 2D scatter plot, the underlying structure of the trajectory may be more complex and better represented by more than two dimensions.

    Scaling

    You can choose to scale the genes prior to building the trajectory. Scaling removes any differences in variability between genes, while not scaling allows more variable genes to have a greater weight in building the trajectory.

    Task report

    Click on the task report, a 2D scatterplot will be opened in Data viewer.

    The trajectory is shown with a black line showing the trajectory. Branch points are indicated by numbers in black circles. By default, cells are colored by state. You can use the control panel on the left to color, size, and shape by genes and attributes to help identify which state is the root of the trajectory.

    Calculating pseudotime

    To calculate pseudotime, you must choose a root state. The tip of the root state branch will have a value of 0 for pseudotime. Click any cell belonging to that state to select the state. The selected state will be highlighted while unselected cells are dimmed. Choose Calculate pseudotime in the Additional actions on the left control panel.

    The Calculate pseudotime task will be performed, it generates a new Pseudotime result data node, which contains Pseudotime annotation for each cell. Open the Pseudotime result report, a 2D scatterplot will be displayed in data viewer, colored by Pseudotime by default.

    References

    [1] Xiaojie Qiu, Qi Mao, Ying Tang, Li Wang, Raghav Chawla, Hannah Pliner, and Cole Trapnell. Reversed graph embedding resolves complex single-cell developmental trajectories. Nature methods, 2017.

    CellPhoneDB

    CellPhoneDB addresses the challenges of studying cell-cell communication in scRNA-seq and spatial data. It allows researchers to move beyond just measuring gene expression and delve into the complicated cellular communication world. By analyzing the scRNA-seq or spatial data through the lens of CellPhoneDB, researchers can identify potential signaling pathways and communication networks between different cell types within the sample. Connected Multiomics wrapped the statistical analysis pipeline (method 2) from CellPhoneDB v5 [1][2] for this purpose.

    How to use CellPhoneDB

    Invoke the CellPhoneDB task from a normalized counts data node using the Exploratory analysis section. We recommend running CellPhoneDB on the log normalized data directly.

    To run CellPhoneDB task,

    • Click a Normalized counts data node

    • Click the Exploratory analysis section in the toolbox

    • Click CellPhoneDB

    Species

    Currently only human, mouse and rat are supported. Select the species of the data from the drop-down list.

    Cell type

    Select the attribute from the drop down. Any categorical attribute associated with the data can be selected, though typically the task is performed on cell typing results.

    Micro environment

    The micro environment file is typically used when analysing spatial data (see the tool's ), and it is optional. The micro environment information can be added to the task using the text box in the task. This can be simply copy-pasted from a micro environment file, a .txt with two columns indicating which cell type (1st column) is in which spatial microenvironment (end column).

    P-value

    Specify the p-value cutoff to employ for significance

    Threshold

    By default, the value of 0.10 will be used as threshold to select which cells are used for the analysis in the cluster. However, the number could be adjusted manually or typed in directly.

    Click the Finish button if you want to run the task as default.

    Double click the CellPhoneDB result data node will open the task report in Data Viewer. It is a heatmap that summarizes how many significant interactions identified in the cell type pairs.

    To explore more, the task of Explore CellPhoneDB results under Exploratory analysis on the pop-up menu. It allows you to focus on specific cell type pairs and genes of interest. Genes of interest are data dependent and usually come from the published results of similar studies or the differential gene analysis between different conditions (eg, cancer patient vs healthy controls). Once set up, click the Finish button to submit the job.

    Double click the Output matrix data node will open the task report in Data Viewer. It is another variant of heatmap that displays how genes of your interest interact in the defined cell type pairs.

    The exampled plot also indicates the data are from two environments. For instructions on setting up the Micro environment file for your spatial study. CellPhoneDB analysis classifies signaling pathways for genes of interest. These classifications are then used to annotate the heatmap within the task report.

    Why are the values of clusterA-clusterB different to the values of clusterB-clusterA?

    It is important to note that the interactions are not symmetric. The authors state that, "Partner A expression is considered for the first cluster/cell type (clusterA), and partner B expression is considered on the second cluster/cell type (clusterB). Thus, IL12-IL12 receptor for clusterA-clusterB (i.e. the receptor is in clusterB) is not the same as IL-12-IL-12 receptor for clusterB-clusterA (i.e. the receptor is in clusterA), and will have different values." [3][4]

    Where do the interactions come from?

    The interactions come from the CellphoneDB database. It is manually curated repository using reviewed molecular interactions with demonstrated evidence for a role in cellular communication. [5]

    References

    1. Troule, etc (2023). CellPhoneDB v5: Inferring cell-cell communication from single cell multiomics data.

    Filter features

    A common task in bulk and single-cell analysis is to filter the data to include only informative features before downstream analysis. Feature here means the measurement of observation, like gene expression, protein expression etc.

    There is no gold standard for what makes a feature informative or not, and ideal feature filtering criteria depend on your experimental design and research question, Connected Multiomics has a wide variety of flexible filtering options.

    Filter features task can be invoked from any counts or single cell data node. The filter is applied to the values in the selected data node and the output is a filtered version of the input data node.

    In the task dialog, select the filter option to activate the filter type and configure the filter, then click Finish to run.

    Noise reduction filter

    The Noise reduction filter lets you exclude features that meet specified descriptive statistics criteria.

    you can choose are:

    • Geometric mean

    • Maximum

    • Mean

    • Median

    For each of these you can choose to exclude features that are:

    • <: less than

    • <=: less than or equal to

    • == equal to

    • >: greater than

    The threshold is set using the text box. The input must be a number; it can be an integer or decimal, positive or negative.

    If you select value, you can also choose a percentage of samples or cells that must meet the criteria for the feature to be excluded.

    Statistics based filter

    The Statistics based filter lets you include a number or percentile of features based on

    Select Counts to specify a number of top features to include or select Percentiles to specify the top percentile of features to include.

    Descriptive statistics you can choose are:

    • Coefficient of variance

    • Geometric mean

    • Maximum

    • Mean

    Feature metadata filter

    If the data linked to feature (gene) annotation, different fields in the annotation can be used to filter, e.g. genomic location information, gene biotype information etc.

    You can specify "AND" "OR" logical operation using different annotation field information.

    Saved list filter

    You can filter features based on a feature lists.

    You should have a saved list in the page when you use this option. Choose a feature list name from the drop-down menu which displays all the feature lists added in .

    You can choose to include or exclude features in any list that you have added.

    Use the Feature identifier option to choose which identifier from your annotation matches the values in the feature list.

    Manual list filter

    This option allows you to type in or paste a list of features, the delimiter should be commas or new lines between the feature names.

    This task will generate a filtered counts data node which contains all the samples/cells from the input data but only the features meet the criteria.

    Publish cell attributes to project

    This task is only available on single cell matrix data node. It will publish one or more cell level attributes to the project level, so the attribute can be edited and seen from all single cell count data nodes within an analysis. This function can be used on annotate cell task output data node, graph based cluster data node etc.

    Running publish cell attributes to project

    • Click on a graph-based clusters data node which contains graph-based clustering group result.

    • Choose Publish cell attributes to project in the Annotation/Metadata section of the toolbox

    This invokes the dialog as

    From the drop-down list to select one or more attributes to publish. Only numeric attributes and categorical attributes with less than 1000 levels will be available in the list.

    After selection e.g Graph-based attribute from the list , click on the plus button () to add, change the attribute by typing in the New name box

    Click Finish at the bottom of the page, all of the attributes will be available to edit on the Data tab > Cell attributes Manage. All data nodes in the project will be able to use those attributes.

    Single-cell QA/QC

    The Single-cell QA/QC task in Connected Multiomics enables you to visualize several useful metrics that will help you include only high-quality cells. To invoke the Single-cell QA/QC task:

    • Click a Single cell counts data node

    • Click the QA/QC section of the task menu

    • Click Single cell QA/QC

    Hurdle model

    Hurdle model is a statistical test for differential analysis that utilizes a two-part model, a discrete (logistic) part for modeling zero vs. non-zero counts and a continuous (log-normal) part for modeling the distribution of non-zero counts. In RNA-Seq data, this can be thought of as the discrete part modeling whether or not the gene is expressed and the continuous part modeling how much it is expressed if it is expressed. Hurdle model is well suited to data sets where features have very many zero values, such as single cell RNA-Seq data.

    On default settings, Hurdle model is equivalent to MAST, a published differential analysis tool designed for single cell RNA-Seq data that uses a hurdle model [1].

    Running Hurdle model

    We recommend normalizing you data prior to running Hurdle model, but it can be invoked on any counts data node.

    Differential methylation

    Differential methylation is used to detect differentially methylated CpG loci (DML) or regions (DMR) between two conditions. The method is based on Bioconductor package DSS (Dispersion Shrinkage for Sequencing data), it is a count-based test. Detailed implementation can be found .

    Current DSS implementation in the Connected Multiomics supports DML/DMR detection from two-group comparison. DML/DMR detection from general experiment design and experiment without replicates are not supported.

    This task can be invoked from the imported 5-base Methylation data node, which contains total read count and methylated read count for each CpG site.

    Click on 5-base Methylation data node, choose Statistics > Differential Methylation

    Graph-based clustering

    What is Graph-based clustering?

    Graph-based clustering is a method for identifying groups of similar cells or samples. It makes no prior assumptions about the clusters in the data. This means the number, size, density, and shape of clusters does not need to be known or assumed prior to clustering. Consequently, graph-based clustering is useful for identifying clustering in complex data sets such as scRNA-seq.

    Summarize Cohort Mutations

    Variant information is stored on a per sample basis, but it can be informative to view variants in the context of recurrent variants identified within the analysis's sample cohort to identify both the frequency of variants and the samples that share a particular variant. The Summarize cohort mutations task can be invoked from any Variants or Annotated variants data node to generate a report of shared variants identified from detection against a reference sequence or among paired samples.

    Summarize cohort mutations dialog

    The Summarize cohort mutations task, user needs to specify Minimum coverage for genotype calls. In general, it is likely that if a variant is not called in a sample at a particular locus then the sample has a homozygous reference genotype. Yet this may not always be the case as factors such as insufficient depth or low quality bases at that locus may lead to an inability of the variant caller to identify any genotype at that locus. As such, setting a minimum coverage will make the assumption that the sample contains a homozygous reference genotype if the depth requirement is met. This is done for the purpose of generating genotype calls for all samples (even reference homozygotes) at all variant loci within the project.

    Click Finish to run

    compare clusters
    compute biomarkers
    Normalization
    Filter groups (samples or cells)
    Lists

    https://github.com/ventolab/cellphonedb-data

    documentation
    https://arxiv.org/pdf/2311.04567.pdf
    https://github.com/ventolab/CellphoneDB
    https://github.com/ventolab/CellphoneDB/blob/master/docs/RESULTS-DOCUMENTATION.md
    https://cellphonedb.readthedocs.io/en/latest/RESULTS-DOCUMENTATION.html#why-values-of-clustera-clusterb-are-different-to-the-values-of-clusterb-clustera

    Percent mapped reads: Percentage of mapped reads, indicate the alignment rate.

  • Average autosomal coverage: Mean autosomal coverage across the whole genome. Higher coverage indicates the counts of methylated/unmethylated more accurately reflects the true methylation amount at any particular site.

  • QC metrics table: Text representations of the QC metrics plots.

  • Minimum
  • Sum

  • Variance

  • Value

  • >=: greater than or equal to
    Median
  • Minimum

  • Range

  • Standard deviation (std dev)

  • Sum

  • Variance

  • Dispersion

  • Descriptive statistics
    descriptive statistics.
    List management
    List management
    How to get started with Illumina Connected Multiomics
    By default, all samples are used to perform QA/QC. You can choose
    Split by sample
    in
    Grouping
    option to perform QA/QC separately for each sample.

    You will be prompted to choose the genome assembly and annotation file by the Single cell QA/QC configuration dialog.

    Note, it is still possible to run the task without specifying an annotation file. If you choose not to specify an annotation file, the detection of mitochondrial counts will not be possible. The annotation file should match the same annotation file used in the upstream analysis.

    The Single cell QA/QC task report opens in a new data viewer session. Four dot and violin plots showing the value of every cell on the canvas: counts per cell, detected features per cell, the percentage of mitochondrial counts per cell (when annotation file contains the genes on MT chromosome), and the percentage of ribosomal counts per cell (human and mouse only).

    If your cells do not express any mitochondrial genes or an appropriate annotation file was not specified, the plot for the percentage of mitochondrial counts per cell will be non-informative.

    Mitochondrial genes are defined as genes located on a mitochondrial chromosome in the gene annotation file. The mitochondrial chromosome is identified in the gene annotation file by having "M" or "MT" in its chromosome name. If the gene annotation file does not follow this naming convention for the mitochondrial chromosome, Connected Multiomics will not be able to identify any mitochondrial genes.

    Ribosomal genes are defined as genes that code for proteins in the large and small ribosomal subunits. Ribosomal genes are identified by searching their gene symbol against a list of 89 L & S ribosomal genes taken from HGNC. The search is case-insensitive and includes all known gene name aliases from HGNC. Identifying ribosomal genes is performed independent of the gene annotation file specified.

    Total counts are calculated as the sum of the counts for all features in each cell from the input data node. The number of detected features is calculated as the number of features in each cell with greater than zero counts. The percentage of mitochondrial counts is calculated as the sum of counts for known mitochondrial genes divided by the sum of counts for all features and multiplied by 100. The percentage of ribosomal counts are calculated as the sum of counts for known ribosomal genes divided by the sum of counts for all features and multiplied by 100.

    Each point on the plots is a cell. All cells from all samples are shown on the plots. The overlaid violins illustrate the distribution of cell values for the y-axis metric.

    The appearance of a plot can be configured by selecting a plot and adjusting the Configure settings in the panel on the left. Here are some suggestions, but feel free to explore the other options available:

    • Open Axes and change the Y-axis scale to Logarithmic. This can be helpful to view the range of values better, although it is usually better to keep the Ribosomal counts plot in linear scale.

    • Within Style switch on Summary Box & Whiskers. Inspecting the median, Q1, Q3, upper 90%, and lower 10% quantiles of the distributions can be helpful in deciding appropriate thresholds.

    High-quality cells can be selected using Select & Filter, which is pre-loaded with the selection criteria, one for each quality metric.

    Hovering the mouse over one of the selection criteria reveals a histogram showing you the frequency distribution of the respective quality metric. The minimum and maximum thresholds can be adjusted by clicking and dragging the sliders or by typing directly into the text boxes for each selection criteria.

    Alternatively, Pin histogram to view all of the distributions at one time to determine thresholds with ease.

    Adjusting the selection criteria will select and deselect cells in all three plots simultaneously. Depending on your settings, the deselected points will either be dimmed or gray. The filters are additive. Combining multiple filters will include the intersection of the three filters. The number of cells selected is shown in the figure legend of each plot.

    To filter the high-quality cells, click the include selected cells icon in Filter in the top right of Select & Filter, and click Apply observation filter...

    Select the input data node for the filtering task and click Select.

    A new data node, Filtered counts, will be generated under the Analyses tab.

    Double click the Filtered counts data node to view the task report. The report includes a summary of the count distribution across all features for each sample; a detailed breakdown of the number of cells included in the filter for each sample; and the minimum and maximum values for each quality metric (expressed genes, total counts, etc) across the included cells for each sample.

  • Click the counts data node

  • Click the Differential analysis section in the toolbox

  • Click Hurdle model

  • Select the factors and interactions to include in the statistical test

  • Numeric and categorical attributes can be added as factors. To add attributes as factors, check the attribute check boxes and click Add factors. To add interactions between attributes, select at least two attributes by clicking check boxes and click Add interaction.

    • Click Next

    • Define comparisons between factor or interaction levels

    Adding comparisons in Hurdle model uses the same interface as ANOVA/LIMMA-trend/LIMMA-voom. Start by choosing a factor or interaction from the Factor drop-down list. The levels of the factor or interaction will appear in the left-hand panel. Select levels in the panel on the left and click the > arrow buttons to add them to the top or bottom panels on the right. The control level(s) should be added to the bottom box and the experimental level(s) should be added to the top box. Click Add comparison to add the comparison to the Comparisons table. Only comparisons in the Comparisons table will be included in the statistical test.

    • Click Finish to run the statistical test

    Hurdle model produces a Feature list task node. The results table and options are the same asANOVA/LIMMA-trend/LIMMA-voom. The percentage of cells where the feature is detected (value is above the background threshold) in different groups (Pct(group1), Pct(group2)) are calculated and included in the Hurdle model report.

    Hurdle model advanced options

    Multiple test correction

    Multiple test correction can be performed on the p-values of each comparison, with FDR step-up being the default. If you check the Storey q-value, an extra column with q-values will be added to the report.

    Use only reliable estimation results

    There are situations when a model estimation procedure does not fail outright, but still encounters some difficulties. In this case, it can even generate p-value and fold change on the comparisons, but they are not reliable, i.e. they can be misleading. Therefore, the default of Use only reliable estimation results is set Yes.

    Data has been transformed with log base

    Shows the current scale of the input data for this task

    Background expression level

    Set the threshold for a feature to be considered expressed for the two-part hurdle model. If the feature value is greater than the specified value, it is considered expressed. If the upstream data node contains log-transformed values, be sure to specify the value on the same log scale. Default value is 0.

    Shrinkage of error term variance

    Applies shrinkage to the error variance in the continuous (log-normal) part of the hurdle model. The error term variance will be shrunk towards a common value and a shrinkage plot will be produced on the task report page if enable. Default is Enabled.

    Shrinkage of regression coefficients

    Applies shrinkage to the regression coefficients in the discrete (logistic) part of the hurdle model. The initial versions of MAST contained a bug that was fixed in its R source in March 2020. However, for the sake of reproducibility the fix was released only on a topic branch in MAST Github [2] and the default version of MAST remained as is. To install the fixed version of MAST in R, run the following R script.

    In Connected Multiomics, the user can switch between the fixed and default version by selecting Fixed version or Default version, respectively. To disable the shrinkage altogether, choose Disabled.

    References

    [1] Finak, G., McDavid, A., Yajima, M., Deng, J., Gersuk, V., Shalek, A. K., ... & Linsley, P. S. (2015). MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome biology, 16(1), 278.

    [2] MAST topic branch that contains the regression coefficient shrinkage fix:

    https://github.com/RGLab/MAST/tree/fix/bayesglm

    Click Next. Select a categorical factor that has the two groups to compare and click Next

    Setup the comparison(s) based on the factor selected:

    The subgroups of the factor are displayed in the left panel; click to select one and move it to one of the boxes on the right. The difference calculation on the comparison will use the group in the top box minus the group in the bottom box. The dialog setup is similar to ANOVA/LIMM-trend/LIMM-Voom.

    Click on Configure in Advanced options, smooth span can be customized, the default value is 500. p-value for DML and DMR setting will be used to filter the results.

    After apply the advanced options, click Finish to run the task.

    The task will generate two data nodes: DML and DMR which represent the differential methylation at loci level and region level respectively.

    Double click on DML node to open the report:

    In this report, each row is a locus which passed the p-value cutoff set in the advanced dialog:

    • chr: Chromosome where the CpG site is located

    • pos: Genomic base pair location of the CpG site

    • pval: Raw p-value from the Wald test for the differential methylation at this site

    • fdr: Adjusted p-value based on Benjamini-Hochberg method

    • diff: Difference in methylation level between the groups. Positive values indicate higher methylation in group 1; negative value indicate higher methylation in group 2.

    • mu1 and mu2: Average methylation level in group 1 and group 2

    • diff.se: Is the standard error of the estimated methylation difference between the two groups

    • stat: Wald test statistics used to assess significance of methylation difference

    • phi1 and phi2: Is dispersion parameter estimated for group 1 and group2. It represents the biological variability in methylation level within the group. Higher value indicates more variability within the group.

    • postprob.overThreshold: Posterior probability that the methylation difference between the two groups exceeds a specified threshold--delta. Default the delta is 0

    The left filter panel usage is the same as GSA report

    Double click to open DMR report. This result is based on DML results

    In this report, each row is a region of a cluster of CpG loci that show consistent differential methylation between the two groups.

    • chr: Chromosome where the region is located

    • start: Start position of the region in base pairs

    • end: Stop position of the region in base pairs

    • length: Length of the region in base pairs

    • nCG: Number of CpG sties within the region

    • abs(areaStat): Absolute value of the areaStat. Large value indicates strong evidence of differential methylation

    • diff.Methy: Difference in average methylation between the two groups

    • meanMethy1 and meanMethy2: Average methylation level across the region in group 1 and group2 respectively.

    • areaStat: Sum of the test statistics (stat in DML) across all the CpG sites in the region.

    References

    Feng, Hao, Karen N Conneely, and Hao Wu. 2014. “A Bayesian Hierarchical Model to Detect Differentially Methylated Loci from Single Nucleotide Resolution Sequencing Data.” Nucleic Acids Research 42 (8): e69–e69.

    Park, Yongseok, and Hao Wu. 2016. “Differential Methylation Analysis for Bs-Seq Data Under General Experimental Design.” Bioinformatics 32 (10): 1446–53.

    Wu, Hao, Chi Wang, and Zhijin Wu. 2012. “A New Shrinkage Estimator for Dispersion Improves Differential Expression Detection in Rna-Seq Data.” Biostatistics 14 (2): 232–43.

    Wu, Hao, Tianlei Xu, Hao Feng, Li Chen, Ben Li, Bing Yao, Zhaohui Qin, Peng Jin, and Karen N Conneely. 2015. “Detection of Differentially Methylated Regions from Whole-Genome Bisulfite Sequencing Data Without Replicates.” Nucleic Acids Research 43 (21): e141–e141.

    here
    Running Graph-based clustering

    Graph-based clustering can run on any counts data node, however, it is very computationally intensive, we recommend running PCA first and run this task on PC output data node using the top few number of PCs

    • Click the counts data node or PCA output data node (recommended)

    • Click the Exploratory analysis section of the toolbox

    • Click Graph-based clustering

    • Configure the parameters

    • Click Finish to run

    Choose which version of the Louvain clustering algorithm to use. Options are Louvain [1], Louvain with refinement [2], SLM [3] and Leiden [4], the default is Louvain, choose number of PCs to use, default is the top 15 PCs. Compute biomarkers task will be performed after this task run if the checkbox is selected. The attribute used in compute biomarkers is the graph-based cluster annotation.

    Cluster results

    Graph-based clustering produces a Clustering result data node. The task report lists the cluster results and cluster statistics. If clustering was run with Split cells by sample enabled on a single cell counts data node, the cluster results table displays the number of clusters found for each sample and clicking the sample name opens the sample-level report.

    The Clustering result data node includes cluster assignment as a new attribute, Graph-based, for each observation. If the Clustering result data node is visualized by Scatter plot, PCA, t-SNE, or UMAP, the plot will be colored by the Graph-based attribute.

    Advanced Graph-based clustering parameters

    Derive neighbor and resolution parameters is occurring automatically for single cell data.

    Derive neighbor and resolution parameters

    Check this box to automatically derive neighbor and resolution parameters. Those parameters are derived from the number of cells, C, as shown below.

    Resolution

    To increase the number of clusters, increase the resolution . To decrease the number of clusters, decrease the resolution.

    A larger number may be more appropriate for larger numbers of cells.

    Prune parameter

    Removes links between pairs of points if their similarity is below the threshold. Larger values lead to a shorter run time, but can result in many singleton clusters.

    Number of nearest neighbors

    Clustering preserves the local structure of the data by focusing on the distances between each point and its k nearest neighbors. The optimal perplexity depends on the size and density of the data. Generally, a larger and/or more dense data set will benefit from a larger number of nearest neighbors. Increasing the number of nearest neighbors will increase the size of clusters and vice versa. The range of possible values is 3 to 100.

    This parameter can be used to speed up clustering at the expense of accuracy. Larger scale implies greater accuracy and helps avoid singletons, but takes more time to run. To maximize accuracy, the total count of observations being clustered should be below the product of nearest neighbors and scale. The range of possible values is 1 to 100,000.

    Modularity function

    The modularity function measures the overall quality of clustering. Graph-based clustering amounts to finding a local maximum of the modularity function. Possibilities are Standard [5] and Alternative [6].

    Number of random starts

    The clustering result depends on the order observations are considered. Each random start corresponds to a different order and result. A larger number of random starts can deliver a better result because the result with the highest quality (modularity) out of all of the random starts is chosen. Increasing the number of random starts will increase the run time. The range of possible values is 2 to 1,000.

    Random seed

    The random seed is used in the random starts portion of the algorithm. Using a different seed might give a better result. Use the same random seed to reproduce results.

    Number of iterations per random start

    To maximize modularity, clustering proceeds iteratively by moving individual points, clusters, or subsets of points within clusters. A larger number of iterations can give better results, but will take longer to run.

    Minimal cluster size

    Clusters smaller than the minimal cluster size value will be merged with a nearby cluster unless they are completely isolated. To avoid isolation, set the prune parameter to zero (default) and the scale parameter to the maximum (default).

    Sequential random starts

    Enable this option to use the slower sequential ordering of random starts.

    Nearest Neighbor Type

    Different methods for determining nearest neighbors. The K nearest neighbors (K-NN) algorithm is the standard. The NN-Descent algorithm is used by UMAP and is an alternative.

    Distance metric

    If NN-Descent is chosen for Nearest Neighbor Type, the metric to use when determining distance between data points in high dimensional space can be set. Options are Euclidean, Manhattan, Chebyshev, Canberra, Bray Curtis, and Cosine. Default is Euclidean.

    References

    [1] Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10), P10008.

    [2] Rotta, R., & Noack, A. (2011). Multilevel local search algorithms for modularity clustering. Journal of Experimental Algorithmics (JEA), 16, 2-3.

    [3] Waltman, L., & Van Eck, N. J. (2013). A smart local moving algorithm for large-scale modularity-based community detection. The European Physical Journal B, 86(11), 471.

    [4]Traag, V.A., Waltman, L. & van Eck, N.J. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep 9, 5233 (2019). https://doi.org/10.1038/s41598-019-41695-z

    [5] Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review E, 69(2), 026113.

    [6] Traag, V. A., Van Dooren, P., & Nesterov, Y. (2011). Narrow scope for resolution-limit-free community detection. Physical Review E, 84(1), 016114.

    For paired variant caller report, if Merge pairs check button is unselected, pairs will be analyzed separately. If it is selected, all samples will be analyzed together.

    Cohort mutation summary report

    The Cohort mutation summary report provides a row in the table for all variant sites, either SNVs or INDELs, identified in the analysis. Hovering over a column header will provide a brief description of the column data.

    Columns presented in the table include the following information:

    • View provides a link to Chromosome View by selecting the chromosome icon .

    • Chr represents chromosome from the reference assembly.

    • Position represents the base position in the chromosome.

    • Mutation type is the category of variant (Substitution for SNVs and Insertion or Deletion for INDELs).

    • Reference allele is the base(s) in the reference assembly sequence.

    • Case genotypes are the genotypes of the samples with a variant at the locus.

    • Variant frequency represents the frequency of the variant site in the sample cohort.

    • Sample count is the fraction of samples in the cohort with the variant.

    • Samples are the names of the samples that contain the variant.

    The Summarize cohort mutations task is not available for variants detected by LoFreq as no genotypes are produced from the caller.

    If variant detection was performed on paired samples in Samtools, the Genotype column will be replaced with four columns: GT Change presents the possible change in zygosity between cases and controls at the variant locus, Control Genotypes are the genotypes of the designated control samples in the pairs, and Case Genotypes are the genotypes of the cases in the pairs.

    Additional columns can be added to the Cohort mutation summary report table by selecting Optional columns. The optional columns are dependent upon the information present in the underlying vcf file and include variant and sample metrics from variant detection and information from the annotation. Hovering over a term in the list will provide a brief description of the data contained in that column. Optional columns can also be used to exclude default columns in the table.

    Below each data column header in the Cohort mutation summary report, the text box allows for filtering of the table. Filtering can be useful for limiting the list of variants to those of interest when large numbers of variants are present in the table. For columns with numbers, exact values or ranges using either ">" or "<" can be utilized in the search. For columns with letters or words, and exact string of characters must be entered in order to obtain a match. In the case of table cells with multiple entries, there must be an exact match between the query and 1 entry to retain the table row.

    If the Summarize cohort mutations task is performed upon an Annotated variants data node, additional information can be presented in the Cohort mutation summary report table. Click on Optional columns to select more fields to add to the table.

    At any point, information in the Cohort mutation summary report table can be saved in text or vcf format by selecting Download at the top left corner of the table. If the table is exported in text format, the visible table will be appended with additional columns for all samples in the analysis. These columns specify the genotype call for each variant locus in the analysis. In instances where no variant was detected within a sample, the value specified by Minimum coverage for genotype calls in the task dialog will be used to call either a homozygous reference genotype if above the specified threshold or no genotype if below the specified threshold.

    Transferred from a public BaseSpace account
    Add data from ICA project
    Illumina Connected Analytics
    auto-launched pipelines
    other sources
    here
    Supported Data Types
    Team
    role
    Download
    Upload
    instructions

    Analyses and Data Management

    Analyses

    In the Analyses tab, you'll see all the analyses in your study. The analyses will be displayed in card view by default. Each card will display the following information.

    Information
    Description

    Analysis Name

    Name of analysis.

    You can switch to list view by clicking the list icon in the top right. In the list view, you can customize the columns and filter your analyses. Click the info icon to get more detail about the analysis including error details.

    Create Analysis

    To create a new analysis, click .

    Give your analysis a name and choose an analysis type from the dropdown menu.

    Analysis Options

    Analysis Option
    Description

    Next, select the sample, sample group, or all samples for the analysis.

    If all samples are chosen for the analysis, a sample group will be automatically added to the Sample Groups menu containing all of the samples in the analysis.

    Current spatial multisample analysis does not yet support the import of a mix of samples that use pipeline-manifest.json and samples that do not. When running an analysis on a sample group that has both types of samples, only those samples with a pipeline-manifest.json file will import.

    Run Analysis

    Click to run the analysis. You will receive a notification that the analysis was successfully created.

    Wait for the analysis status to change from "Pending" to "Complete".

    Analysis Statuses

    The following table describes the different analysis statuses.

    Status
    Description

    Once the analysis is complete, you can click into it to view the results. Refer to the "" section for more information on how to view your analysis.

    Data

    In the Data tab, you'll see a list of all the data and metadata files within your study from which your samples are derived.

    Note that deleting data or metadata files will not remove any samples that were added to Connected Multiomics from those files. The samples will remain in Connected Multiomics even if the original files are deleted, ensuring that your sample data stays intact.

    Descriptive statistics

    Descriptive statistics task can be invoked on matrix data node e.g. Gene Counts, Normalized Counts data node in bulk RNA seq analysis pipeline or Single Cell counts Data node etc. It calculates measures of central tendency and variability on observations or features of the matrix data.

    Running Descriptive statistics

    • Click on a matrix data node

    • Choose Descriptive Statistics in Statistics section of the toolbox

    This will invoke the dialog configuration dialog; use it to specify which calculation(s) will be performed on observations or features.

    When select the calculation is for observations (samples or cells), there is a drop-down option to use all the features in the input data node or a list of features. If you use a saved feature list, you can use the check button to select whether match of the saved list to your data is case sensitive or not.

    When select the calculation is for features, selecting Group by drop-down list allows to compute the statistics in each group separately

    Click on the button to add more than one attributes, the result will be on the groups from the interaction terms of selected attributes.

    The available statistics are listed on the left panel, suppose "x1, x2, ..., xn"represent an array of numbers

    • Coefficient of variation (CV): s represent the standard deviation

    • Geometric mean:

    • Max:

    • Mean:

    Left click to select measurement and drag to move to the right panel one at a time, or when you mouse over on a measurement, click on the + button to move to the right panel and click Finish.

    The output data node can be downloaded or visualized in Data Viewer:

    Overview

    The Study overview screen will display the following Study Information:

    Information
    Description

    Date Created

    The date and time the study was created.

    Created By

    The user who created the study.

    Date Modified

    The last date and time the study was modified. This date updates whenever a user performs an action within the study, such as adding data, creating a sample group, starting a new analysis, etc.

    Modified By

    The user who last modified the study.

    You will also see the following Data Metrics:

    Information
    Description

    If there are analyses in your study, you will see the Recent Analyses section at the bottom of the Overview screen. For more information on analyses, see the section.

    Update Study

    To update any details of your study, click the gear icon in the top right. This will open a popup where you can edit the study name, description, ICA project, automatic single sample default analysis and automatic data import options.

    If you remove any data types for automatic data import, output files from the pipelines associated with the data types you've removed will no longer be automatically imported into the study upon pipeline completion. On the other hand, if you add a new data type, output files from the corresponding pipelines will now be automatically imported into the study upon pipeline completion.

    Click to save the changes.

    Add Data

    To add data, click the button in the top right. You can select data from an ICA project, , or . Selecting data from an ICA project will open a new screen where you can select your data. At the top of the screen, you'll see the name of your study. Here you can select the data type to import.

    For each data type, you will only be able to choose data formats from Illumina or supported third-party formats. Refer to the to see which file extensions are accepted for each data format. TSV metadata files can always be selected. For example, if you choose Bulk > Proteomics as the data type, you will only see ADAT files and TSV files as options for upload.

    Each column has a filter icon to filter data and three dots which you can click on to find more options. Select an entire folder or click into the folder to select files individually. Demo data that is used to generate the is available to you to add to your Study by clicking .

    Once you've selected your data, click to add the data to your study. Upon submitting, you'll be redirected to the Samples tab, and a message will indicate your data ingestion was successful. Here you can from local storage, from an ICA project, create , or create .

    Now, if you click back to the Overview tab, you'll see that the Date Modified was updated and the Data Metrics reflect the newly added files and samples.

    Repeat the process to add data for additional data types.

    Monocle 3

    In Connected Multiomics, we use tools from Monocle 3 (1) to build trajectories, identify states and branch points, and calculate pseudotime values. The output of Trajectory analysis task includes an interactive 2D/3D visualization for viewing the trajectory trees and setting the root states (starting points of the trajectories). From the Trajectory analysis report, you can run a second task, Calculate pseudotime, which adds a numeric cell-level attribute, Pseudotime, calculated using the chosen root states.

    Prerequisites for the Analysis

    Trajectory analysis by Monocle 3 requires normalized counts data. According to the Monocle 3 authors, you may want to filter in the top 5,000 genes with the highest variance (2,000 genes for datasets with fewer than 5,000 cells, and 300 genes for datasets with fewer than 1,000 cells) (1). Those numbers should be used as a guidance for the first-pass analysis and may need to be optimized, depending on the project at hand and the biological question.

    Setting up Trajectory Analysis

    To run Trajectory analysis tool, select the Normalized counts data node (or equivalent) and go to the toolbox: Exploratory analysis > Trajectory analysis

    The configuration dialog presents four options

    1. Dimensionality of reduced space. This option specifies the number of UMAP dimensions that the original data are reduced to, in order to learn the trajectory tree (dimensionality of original data equals the number of genes). Default is two, meaning that the trajectory plot will be draw in two dimensions. To get a 3D trajectory plot, increase this option to 3.

    2. Scaling. Normalized expression values can be further transformed by scaling to unit variance and zero mean (i.e. converting to Z score). The use of this option is recommended (1).

    3. Data is logged. Select this option if the data have already been log-transformed upstream. When selected, Monocle 3 will skip the log2 step on the input data (see below).

    Under the hood, Monocle 3 will perform log2 transformation of the gene count matrix (if Data is logged was unselected), scale the matrix (if Scaling was selected), and project the gene count matrix into the top 50 principal components. Next, the dimensionality reduction will be implemented by UMAP (using default settings of the reduce dimension command).

    Trajectory Analysis Result

    Result of running Trajectory analysis is the Trajectory result data node. Double clicking on the node opens a Data Viewer window with the trajectory plot. Cell trajectory graph shows position of each cell (blue dot) with respect to the UMAP coordinates (axes). Cell trajectories (one or more, depending on the data set) are depicted as black lines. Gray circles are trajectory nodes (i.e. cell communities).

    To show / hide cell trajectory tree and trajectory nodes, select Axes in Configure section and at the bottom of the dialog, select the Extra data drop-down options

    Pseudotime Analysis

    To perform pseudotime analysis, you need to point to the cells at the beginning of the biological process you are interested in. For example, cells at the earliest stage of differentiation sequence. There are two ways to perform pseudotime analysis in Partek Flow, depending on the way the root nodes (=cells at the beginning of pseudotime) are defined.

    1. Manual selection of root node. The user has to specify the root nodes (one or more).

    2. Automatic selection of the root node. The root node is picked by the algorithm.

    Manual Selection of the Root Node

    If you want to manually pick the root nodes, leave the option Programmatically calculate default root nodes unselected when setting up the Trajectory analysis.

    To start, select the root cell nodes (gray circles in trajectory tree) by left-clicking. If the trajectory result consists of more than one trajectory tree, you can specify more than one root node, e.g. one root node per trajectory tree (ctrl & click). If no root node is specified for a tree, that tree will not be included in the pseudotime calculation. The following dialog shows an example where two root nodes were identified.

    Click on Additional button in Tools section on the left panel, push the Calculate pseudotime button in the dialog

    As a result, the cells will be annotated by pseudotime, If, for a particular tree, no root node has been identified, those cells will be omitted from the pseudotime calculation and will be colored in black.

    Pseudotime calculation display the structure of the graph using black lines. The circles with numbers (cell nodes) on the black lines represent special points. There are three types of cell nodes:

    1. Root node (white). Root nodes are start points of the pseudotime and were defined by the user in the previous step.

    2. Branch node (black). Branch nodes indicate where the trajectory tree forks out; i.e. each branch represents a different cell fate or different trajectory.

    3. Leaf (light gray). Leaves correspond to different cell fates / different trajectory outcomes . The leaves correspond to cell states of .

    The numbers within the circles are provided for reference purposes only. The intermediate nodes from the previous step have been removed.

    Automatic Selection of the Root Node

    If suitable meta-data are available, it is possible to automatically select the root node. For example, you may know which cells were harvested from the earliest time points. The cells need to be annotated by that information ( task) before running Trajectory analysis. The annotation will, in turn, be available in the Trajectory analysis setup dialog, upon selecting the Programmatically calculate default root nodes option.

    1. Attribute for root nodes. The drop down list will show the available cell-level attributes. Specify the one which should be used to identify the root nodes.

    2. Attribute value for root nodes. The drop down list will show the content of the attribute selected under Attribute for root nodes. Specify the entry that corresponds to the earliest time point

    Once the options have been set, Monocle 3 will first group the cells according to which trajectory node they are nearest to. It then calculates the fraction of the cells from the earliest time point at each trajectory node. Finally, it picks the node with the highest prevalence of the early cells and treats it as the root node.

    References

    1. Cao J, Spielmann M, Qiu X, Huang X, Ibrahim DM, Hill AJ, Zhang F, Mundlos S, Christiansen L, Steemers FJ, Trapnell C, Shendure J. The single-cell transcriptional landscape of mammalian organogenesis. Nature. 2019 Feb;566(7745):496-502. doi: 10.1038/s41586-019-0969-x. Epub 2019 Feb 20. PMID: 30787437; PMCID: PMC6434952.

    ANOVA/LIMMA-trend/LIMMA-voom

    ANOVA method is applying a specified log normal model to all the features.

    • ANOVA dialog

    • ANOVA advanced options

    • ANOVA report

    ANOVA dialog

    To setup ANOVA model or the alternative Welch's ANOVA (which is used on normally distributed data that violates the assumption of homogeneity of variance), select factors from sample attribute. The factors can be categorical or numeric attribute. Click on a check button to select and click Add factors button to add it to the model.

    LIMMA-trend and LIMMA-voom setup dialogs are identical to ANOVA's setup.

    Note: LIMMA-voom method can only be invoked on the following normalization output data node, those methods can produce library sizes:

    TMM, CPM, Upper Quartile, Median ratio, Postcounts

    When more than one factor is selected, click Add interaction button to add interaction term of the selected factors.

    Once a factor is added to the model, you can specify whether the factor is a random effect (check Random check box) or not.

    Most factors in an analysis of variance are fixed factors, i.e. the levels of that factor represent all the levels of interest. Examples of fixed factors include gender, treatment, genotype, etc. However, in experiments that are more complex, a factor can be a random effect, meaning the levels of the factor only represent a random subset of all of the levels of interest. Examples of random effects include subject and batch. Consider the example where one factor is type (with levels normal and diseased), and another factor is subject (the subjects selected for the experiment). In this example, “Type” is a fixed factor since the levels diseased and normal represent all conditions of interest. “Subject”, on the other hand, is a random effect since the subjects are only a random sample of all the levels of that factor. When model has both fixed and random effect, it is called a mixed model.

    When more than one factor is added to the model, click on the Cross tabulation link at the bottom to view the relationship between the factors in a different browser tab.

    Once the model is set, click on Next button to setup comparisons (contrasts).

    Start by choosing a factor or interaction from the Factor drop-down list. The subgroups of the factor or interaction will be displayed in the left panel; click to select one or more level(s) or subgroup name(s) and move them to one of the boxes on the right. The ratio/fold change calculation on the comparison will use the group in the top box as numerator, and the group in the bottom box as the denominator. When multiple levels (groups) are in either numerator or denominator box(es), in Combine mode, click on Add comparison button to combine all numerator levels and combine all denominator levels in a single comparison in the Comparison table below; in Pairwise, click on Add comparison button will split all numerator levels and denominator levels into a factorial set of comparisons – in other words, it will add every numerator level paired with every denominator level comparisons to the Comparison table . Multiple comparisons from different factors can be added from the specified model.

    ANOVA advanced options

    Click on the Configure to customize Advanced options

    Multiple test correction sections are the same as the matching GSA advanced option, see above .

    Report option

    • Use only reliable estimation results: There are situations when a model estimation procedure does not fail outright, but still encounters some difficulties. In this case, it can even generate p-value and fold change on the comparisons, but they are not reliable, i.e. they can be misleading. Therefore, the default of Use only reliable estimation results is set Yes.

    • Report p-value for effects: If set to No, only the p-value of comparison will be displayed on the report, the p-value of the factors and interaction terms are not shown in the report table. When you choose Yes in addition to the comparison’s p-value, type III p-values are displayed for all the non-random terms in the model.

    • Shrinkage to error term variance: by default, None is select, which is lognormal model. Limma-trend and Limma-voom options are lognormal with shrinkage. (Limma-trend is the same as the GSA default option–lognormal with shrinkage). Shrinkage options are recommended for small sample size design, no random effects can be included when performing shrinkage. If there are numeric factors in the model, the partial correlations cannot be reported on the numeric factors when shrinkage is performed. Limma-trend works well if the ratio of the largest library size to the smallest is not more than 3 fold, it is simple and robust for any type of data. Limma-voom is recommended for sequencing data when library sizes vary substantially, but it can only be invoked on data node normalized using TMM, CPM, or Upper quartile methods while Limma-trend can be applied to normalized data using any method.

    ANOVA report

    Since there is only one model for all features, so there is no pie charts design models and response distribution information. The Gene list table format is the same as the .

    References

    1. Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, JRSS, B, 57, 289-300.

    2. Storey JD. (2003) The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics, 31: 2013-2035.

    3. Auer, 2011, A two-stage Poisson model for testing RNA-Seq

    4. Burnham, Anderson, 2010, Model selection and multimodel inference

    Getting Started

    During the software registration process, you will create a domain and workgroup. and are used by Illumina Software to control access to different customer’s data and assets. Make sure all users are added to the workgroup and that the workgroup has the necessary permissions to access the Connected Multiomics application, the users need to have the subscription of the software.

    Here are the details of the steps to register in video format and as a step-by-step walkthrough:

    Step 1: Domain Registration

    To register your Illumina Connected Software, click the registration link in the e-mail provided by Illumina.

    This will bring you to the Illumina Software Registration Portal. If you do not have an Illumina Connected Software account, click

    Normalization

    Raw read counts are generated after quantification for each feature on all samples. These read counts need to be normalized prior to differential expression detection to ensure that samples are comparable.

    This chapter covers the implementation of each normalization method. The Normalize counts option is available on the context-sensitive menu upon selection of any data node contains sample/cell by feature matrix.

    The format of the output is the same as the input data format, the node is called Normalized counts. This data node can be selected and normalized further using the same task.

    Selecting Methods

    Select whether you want your data normalized on samples/cells or per features. Some transformations are performed on each value independently of others e.g. log transformation, and you will get an identical result regardless of this choice.

    Correlation

    Correlation analysis is used to test the relationship of two numeric variables. It determines the strength and direction of the association between them. The methods included in Connected Multiomics are linear correlation like Pearson's correlation and rank correlation like Spearman's rank correlation, Kendall's Tau correlation.

    There are four type of format used to compute correlation:

    Survival Analysis with Cox regression and Kaplan-Meier analysis

    Introducing Survival Analysis

    Survival analysis is a branch of statistics that deals with modeling of time-to-event. In the context of “survival,” the most common event studied is death; however, any other important biological event could be analyzed in a similar fashion (e.g., spreading of the primary tumor or occurrence/relapse of disease). Survival analysis tries to answer questions such as: What is the proportion of a population who will survive past a certain time (i.e., what is the 5-year survival rate)? What is the rate at which the event occurs? Do particular characteristics have an impact on survival rates (e.g., are certain genes associated with survival)? Is the 5-year survival rate improved in patients treated by a new drug? Cox regression and Kaplan-Meier analysis are two techniques which are commonly used to assess survival analysis.

    In survival analysis, the event should be well-defined with two levels and occur at a specific time. Because the primary outcome of the event is typically unfavorable (e.g., death, metastasis, relapse, etc.), the event is called a “hazard.” The hazard ratio is used to assess the likelihood of the event occurring while controlling for other co-predictors (co-variables/co-factors) if added to the model. In other words, the hazard ratio is how rapidly an event is experienced by comparing the hazard between groups. A hazard ratio greater than 1 indicates a shorter time-to-event (increase in the hazard), a hazard ratio less than 1 is associated with a greater time-to-event (reduction in the hazard), and a hazard ratio of 1 indicates no effect on time-to-event. For example, if the hazard ratio is 2 then there is twice a chance of occurrence compared to the other group. In cancer studies, a hazard ratio greater than 1 is considered a bad prognostic factor while a hazard ratio less than 1 is a good prognostic factor.

    \# Uninstall the default version of MAST, if it's installed.  
    remove.packages("MAST")
    \# Install devtools, if it's not installed yet.  
    library("devtools")
    install\_github("[https://github.com/RGLab/MAST/tree/fix/bayesglm](https://github.com/RGLab/MAST/tree/fix/bayesglm)")
    library(MAST)

    Programmatically calculate default root nodes. If not selected, user has to specify the root nodes of the trajectory tree manually (default). Depending on the available meta-data, Monocle 3 may be able to pick the root nodes programmatically.

    Monocle 2
    Annotate Cells
  • Report partial correlations: If the model has a numeric factor(s), when choosing Yes, partial correlation coefficient(s) of the numeric factor(s) will be displayed in the result table. When choosing No, partial correlation coefficients are not shown.

  • Data has been log transformed with base: showing the current scale of the input data on this task.

  • Law C, Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 2014 15:R29.

  • http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/index.html#cuffdiff-output-files

  • Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biology, 2010

  • References
    GSA advanced options
    GSA report

    ICA Project

    The ICA project the study imports data from.

    Description

    The description of the study, provided by the user during study creation.

    Sample Files

    The number of sample files within the study, including both manually added and auto-ingested sample files. Please see the Supported Data Types for a list of possible sample file extensions.

    Metadata Files

    The number of metadata files within the study, identified by the .tsv file extension.

    Samples

    The number of samples within the study. Note that multiple protein samples can come from a single ADAT file, whereas a single spatial or scRNA sample is created for each set of spatial/scRNA files. Please see the Supported Data Types for more information.

    Analyses

    The number of analyses within your study, including both in-progress and completed analyses.

    Attributed Sample Count

    The number of samples in the study with linked metadata.

    Metadata Entry Count

    The number of metadata attributes displayed in the study. This corresponds with the number of metadata columns shown next to your samples.

    Analyses
    upload metadata
    upload local files
    Supported Data Types
    Tutorial Study
    upload metadata
    add additional data
    sample groups
    new analysis
    Merge differential expression results

    Custom: Multiomics

    Available inputs include Illumina proteomics, single cell, spatial, and bulk transcriptomics, miRNA, 5-base DNA and Infinium methylation as well as third-party data types including Seurat (RNA), Seurat (ATAC), Somalogic ADAT, Gene counts in sf Format, and VCF. Two or more data types can be analyzed together in any combination.

    Custom: Illumina Proteomics

    Starts with the quantified samples which have undergone prior normalization and offers flexible analysis options. The maximum samples for analysis is 9,000.

    Custom: Illumina Single Cell Transcriptomics

    Starts with single cell counts and offers flexibility with the analyses pipeline step. For each sample, all features are reported and if feature ID's are not unique, Mean is used for Deduplication; raw counts are used as the count value format and cells with a total read count at least 400 are reported. The maximum samples for analysis is 1,000.

    Custom: Illumina Spatial Transcriptomics

    Two starting nodes as cell-binned or grid-binned data, including the spatial image outputs, and offers flexibility with the analyses parameters. The maximum samples for analysis is 100.

    Custom: Illumina Bulk Transcriptomics

    Starts with salmon format sample counts that have not been normalized and offers flexible analyses options. The assembly and annotation model used in secondary analysis is required to annotate the features. The maximum samples for analysis is 500.

    Custom: Illumina miRNA

    Starts with imported count matrix that have not been normalized and offers flexible analysis options. The maximum samples for analysis is 500.

    Custom: Illumina 5-base DNA Methylation

    Starts with importing selected 5-base methylation data and offers flexible analysis options. The maximum samples for analysis is 200.

    Custom: Illumina Infinium Methylation

    Starts with importing selected idat files and offers flexible analysis options.

    Custom: Third-party analysis

    Starts with data import. This option appears only when third-party assay data has been uploaded.

    Status

    Status of analysis. See Analysis Statuses for a table of statuses and descriptions.

    Analysis Type

    Type of omic data used in the default or custom analysis.

    Modified By

    Last user who modified the analysis.

    Date Modified

    Date and time when the last update was made.

    Default: Illumina Proteomics

    General analysis pipeline for Illumina protein prep samples, with PCA and hierarchical clustering heatmap results. The maximum samples for analysis is 9,000.

    Default: Illumina Single Cell Transcriptomics

    General analysis pipeline for Illumina single cell prep samples, including the processing steps to output PCA and UMAP plots, plus graph-based clustering visualizations.

    For each sample, all features are reported and if feature ID's are not unique, Mean is used for Deduplication; raw counts are used as the count value format and cells with a total read count at least 400 are reported. The maximum samples for analysis is 1,000.

    Default: Illumina Spatial Transcriptomics

    General analysis pipeline for Illumina spatial samples, including a Spatial map with transcripts overlayed on the tissue image where each point is a grid. Graph-based clusters are also plotted on a UMAP and pie chart. The maximum samples for analysis is 100.

    Default: Illumina Bulk Transcriptomics

    General analysis pipeline for Illumina mRNA and Total RNA prep samples including PCA and hierarchical clustering heatmap results. The maximum samples for analysis is 500.

    Default: Illumina miRNA

    General analysis pipeline for Illumina miRNA prep samples including PCA and hierarchical clustering heatmap results. The maximum samples for analysis is 500.

    Default: Illumina 5-base DNA Methylation

    General analysis pipeline for 5-base DNA prep samples including regional methylation PCA plots, k-means clustering, and QC. The maximum samples for analysis is 200.

    Pending

    The analysis has been initiated but has not yet started. It is waiting in the queue to be processed.

    In Progress

    The analysis is currently being executed. The system is processing the data and generating results.

    Complete

    The analysis has finished, and the results are ready to be viewed. Click into the analysis to view it.

    Error

    The analysis has ended in error. Using the Analysis tab list view, click the info icon to get more information about the error type. The error can be further diagnosed using the Inspect analysis link.

    Enter Analysis

    Median: when n is odd, median is , when n is even, median is

  • Median absolute deviation: , where

  • Min:

  • Number of cells: Available when Calculate for is set to Features. Reports the number of cells with the value [<, <=, =, !=, > >=] (select one from the drop down list) than the cut off value entered in the text box. The cut off will be applied to the values present in the input data node, i.e. if invoked on non-normalised data node, the values are raw counts. For instance, use this option if you want to know the number of cells in which each feature was detected; possible filter: Number of cells whose value > 0.0

  • Percent of cells: Available when Calculate for is set to Features. Reports the number of cells with the value [<, <=, =, !=, > >=] (select one from the drop down list) than the cut off value entered in the text box.

  • Number of features: Available when Calculate for is set to Cells. Reports the number of features with the value [<, <=, =, !=, > >=] (select one from the drop down list) than the cut off value entered in the text box. The cut off will be applied to the values present in the input data node, i.e. if invoked on non-normalised data node, the values are raw counts. For example, use this option if you want to know the number of detected genes per each cell; filter: Number of features whose value > 0.0

  • Percent of features: Available when Calculate for is set to Cells. Reports the fraction of features with the value [<, <=, =, !=, > >=] (select one from the drop down list) than the cut off value entered in the text box.

  • Q1: 25th percentile

  • Q3: 75th percentile

  • Range: xmax - x min

  • Standard deviation: where

  • Sum:

  • Variance:

  • Sign Up
    to create a new account.

    If you already have a user account for Illumina Connected Software, enter those credentials into the login screen.

    Once logged in, you can review the order details for each order. Select your order for Illumina Connected Multiomics and click Setup. Multiple Connected Multiomics subscriptions can be selected and registered at one time.

    You will need to select a region and create a domain to which the software will be registered. Choose an existing domain or create a new domain for your account.

    When creating a new domain, you will be considered the administrator for that domain. Type a display name for the domain, as well as the domain URL. Also enter specific e-mail addresses or e-mail extensions that are permitted in this domain. The list of allowed emails and e-mail extensions can be updated later in the Admin Console.

    Click Setup to complete order registration.

    A confirmation e-mail will be sent once the domain URL is activated. In the confirmation e-mail, access your domain by clicking on the domain URL link provided.

    Adding the link to your browser's bookmark is recommended for easier access in the future.

    Sign in to your domain by entering your credentials into the login screen. Once signed in, the product dashboard will display all Illumina Connected software platforms you have access to. The Admin Console is a platform designed for administrators to manage domain access and control workgroup permissions.

    Step 2: Workgroup Creation

    Workgroups are groups of users that can share projects and data. Some Illumina Connected software platforms require workgroups to function.

    In product dashboard, click on Admin on the left panel to go to Admin Console

    Click on WORKGROUPS tab on the top of the dialog. Workgroups can only be created by domain administrators.

    Click on button to add a new workgroup, specify a workgroup name, as well as a description. Type in Administrator Email and click Create.

    Note: You can allow users outside this domain to be added in a workgroup if you select Enable collaborators outside of this domain option.

    Any domain user can be assigned to be the workgroup administrator. The workgroup administrator can manage workgroup membership by adding or removing domain users.

    If you wish to add a collaborator to a workgroup, that collaborator must first be added as a user in that domain.

    Step 3: Add Users (Optional)

    To invite new users to a domain, click on the DOMAIN tab on the top of the Admin Console dialog. Then click USER MANAGEMENT on the left panel, followed by clicking the Domain Invitation tab.

    Type in the e-mail address of the user you wish to invite to the domain, then click Invite. An invitation e-mail will be sent to the invited user.

    The user should follow the instructions in the email to register an account with the domain.

    The domain administrator can verify the successful registration of the invited user in the Admin Console. Click on the DOMAIN tab on the top of the dialog, select USER MANAGEMENT on the left panel, click Users tab in the User Management page. Click Manage button. The user state information will be displayed at the bottom of the page.

    The newly registered user can now be added as a collaborator in a workgroup.

    To add the registered user to a workgroup, navigate to the WORKGROUPS tab on the top of the Admin Console dialog.

    Select the workgroup you want to add users to, then click on USERS on the left panel, and then click on invite button

    Type in the user's email and specify product access. Click Grant Access.

    Note: The options to invite via public Illumina account or invite via collaborative enterprise only appear if the Workgroup was setup as collaborative. These can be used to invite Illumina tech support or other outside email addresses to your domain.

    The users added to this workgroup will be listed on the Users page.

    Step 4: Subscription Assignment

    For certain subscriptions, such as Illumina Connected Multiomics, users need to be assigned to its subscription in order to access the associated application.

    To give a user access to Illumina Connected Multiomics for example, go to the product dashboard, click on Subscriptions on the left side panel.

    At the Subscriptions panel, look for Illumina Connected Multiomics and click Assign.

    Only domain administrators can assign subscription. Enter the e-mail address of a registered domain user that you wish to assign the subscription to and click Assign.

    Once a user is assigned to a domain and workgroup with necessary permission, the user will be able to perform data analysis in Connected Multiomics.

    Log In To Connected Multiomics

    Once you have successfully set up your account and created your domain and workgroup, you can log in to Connected Multiomics. When logging into Connected Multiomics, you will be required to select a domain and workgroup to establish the context for your work. Follow the steps below.

    1

    Access the software by clicking this link: login.illumina.com/login

    Adding the URL link to your browser bookmark is recommended for easy access in the future.

    2

    Log into your domain by entering your credentials into the login screen, then click Sign In.

    3

    Once signed in, you will be directed to Product Dashboard page. Select the Illumina Connected Multiomics application tile from your Product Dashboard.

    4

    Select your workgroup from the dropdown menu if you have more than one workgroup.

    Once you select a workgroup, it will automatically be chosen during future logins. This selected workgroup will not be applied to other Illumina Connected Software applications. If you have only one workgroup, it will be selected for you by default.

    You will now be directed to the Studies overview screen. In the top right, you can click on your profile button to see your current workgroup, domain, and subscription information. To change workgroups, you can click on "Switch workgroup" and select your desired workgroup from the dropdown menu.

    Explore Onboarding Resources

    When you log into Connected Multiomics for the first time, you will be greeted with a Welcome Screen.

    Click Get started to take you directly to the home page of the application and begin navigating on your own.

    To explore a read-only interactive environment with pre-assembled studies, analyses, results, and visualizations, choose Try with demo data to bring you to the Connected Multiomics Tutorial Study. The source demo files used in the Tutorial Study are available to add to and use in your own Studies later, too.

    Select Help resources for easy links to user guide content, including a helpful query box, detailed walkthrough tutorials, instructional videos, and software release notes.

    You may return to the Welcome Screen and these selections anytime you'd like by selecting the icon on the top right of the application window next to your profile.

    Navigate Connected Multiomics

    In the Connected Multiomics navigation on the left panel, you'll see the following structure:

    • Studies are your primary work locations which contain your data and tools to execute your analyses. Studies can be considered as a binder for your work and information.

    • Analyses are a list of the analyses you have launched. Here, you can access graphs, view visualizations, and gather insights related to your data.

    • Sample Groups are collections of samples that are organized based on one or more metadata fields, allowing you to group and manage data effectively for analysis.

    Click on each section below to learn more.

    Domains
    Workgroups

    The following normalization methods will generate different results depending on whether the transformation was performed on samples or on features:

    • Divided by mean, median, Q1, Q3, std dev, sum

    • Subtract mean, median, Q1, Q3, std dev, sum

    • Quantile normalization

    Note that each task can only perform normalization on samples or features. If you wish to perform both transformations, run two normalization tasks successively. To normalize the data, click on a method from the left panel, then drag and drop the method to the right panel. Add all normalization methods you wish to perform. Multiple methods can be added to the right panel and they will be processed in the order they are listed. You can change the order of methods by dragging each method up or down. To remove a method from the Normalization order panel, click the minus button to the right of the method. Click Finish when you are done choosing the normalization methods.

    Recommended Methods

    For some data nodes, recommended methods are available, the Use recommended button will appear, click on it the method will be populated the right panel. For instance, the recommended method for scRNA-seq data is log2(CPM+1):

    Normalization Methods

    Below is the notation that will be used to explain each method:

    Symbol
    Meaning

    S

    Sample (or cell for single cell data node)

    F

    Feature

    Xsf

    Value of sample S from feature F (if normalization is performed on a quantification data node, this would be the raw read counts)

    TXsf

    transformed value of Xsf

    C

    Constant value

    b

    Base of log

    • Absolute value TXsf = | Xsf |

    • Add TXsf = Xsf + C a constant value C needs to be specified

    • Antilog TXsf = bxsf A log base value b needs to be specified from the drop-down list; any positive number can be specified when Custom value is chosen

    • Arcsinh TXsf =arcsinh (Xsf) The hyperbolic arcsine (arcsinh) transformation is often used on flow cytometry data

    • CLR (centered log ratio) TXsf =ln((Xsf +1)/geom (Xsf +1) +1) geom is geometric mean of either observation or feature. This method can be applied on protein expression data.

    • CPM (counts per million) TXsf = (106 x Xsf)/TMRs where Xsf here is the raw read of sample S on feature F, and TMRs is the total mapped reads of sample S. If quantification is performed on an aligned reads data node, total mapped reads is the aligned reads. If quantification is generated from imported read count text file, the total mapped reads is the sum of all feature reads in the sample.

    • Divided by When mean, median, Q1, Q3, std dev, or sum is selected, the corresponding statistics will be calculated based on the transform on sample or features option Example: If transform on Samples is selected, Divide by mean is calculated as: TXsf = Xsf/Ms where Ms is the mean of the sample. Example: If transform on Features is selected, Divide by mean is calculated as: TXsf = Xsf/Mf where Mf is the mean of the feature.

    • Log TXsf = logbXsf A log base value b needs to be specified from the drop-down list; any positive number can be specified when Custom value is chosen

    • Logit TXsf=logb(Xsf/(1-Xsf)) A log base value b needs to be specified from the drop-down list; any positive number can be specified when Custom value is chosen

    • Lower bound A constant value C needs to be specified, if Xsf is smaller than C, then TXsf= C; otherwise, TXsf = Xsf

    • **Median ratio (DESeq2 only), Median ratio (edgeR) **These approaches are slightly different implementations of the method proposed by Anders and Huber (2010). The idea is as follows: for each feature, its expression is divided by the feature geometric mean expression across the samples. Then, for a given sample, one takes median of these ratios across the features and obtains a sample specific size factor. The normalized expression is equal to the raw expression divided by the size factor. Median ratio (DESeq2 only) is present in R, DESeq2 package, under the name of "ratio". This method should be selected if DESeq2 differential analysis will be used for downstream analysis, since it is not per million scale, not recommended to be used in any other differential analysis methods except for DESeq2. Median ratio (edgeR) is present in R, edgeR package under the name of “RLE”. It is very similar to Median ratio (DESeq2 only) method, but it uses per million scale.

    • Multiply by TXsf = Xsf x C A constant value C needs to be specified

    • Poscounts (Deseq2 only) Deseq2 size factor estimate option. Comparing with Median ratio, poscount method can be used when all genes contain a sample with a zero. It calculates a modified geometric mean by taking the nth root of the product of the non-zero counts. It is not per million scale. Here is the .

    • Quantile normalization, a rank based normalization method. For instance, if transformation is performed on samples, it first ranks all the features in each sample. Say vector Vs is the sorted feature values of sample S in ascending order, it calculates a vector that is the average of the sorted vectors across all samples --- Vm, then the values in Vs is replaced by the value in Vm in the same rank. Detailed information can be found in [1].

    • Rank This transformation replaces each value with its rank in the list of sorted values. The smallest value is replaced by 1 and the largest value is replaced by the total number of non-missing values, N. If there are no tied values, the results in a perfectly uniform distribution. In the case of ties, all tied values receive the mean rank.

    • Rlog Regularied log transformation is the method implemented in DESeq2 package under the name of rlog. It applies a transformation to remove the dependence of the variance on mean. It should not be applied on zero inflated data such as single cell RNA-seq raw count data. The output of this task should not be used for differential expression analysis, but rather for data exploration, like clustering etc.

    • Round Round the value to the nearest integer.

    • RPKM (Reads per kilobase of transcript per million mapped reads [2]) TXsf = (109 * Xsf)/(TMRs*Lf) Where Xsf is the raw read of sample S on feature F, TMRs is the total mapped reads of sample S, Lf is the length of the feature F,

      If quantification is performed on an aligned reads data node, total mapped reads is the aligned reads. If quantification is generated from imported read count text file, the total mapped reads is the sum of all feature reads in the sample. If the feature is a transcript, transcript length Lf is the sum of the lengths of all the exons. If the feature is a gene, gene length is the distance between the start position of the most downstream exon and the stop position of the most upstream exon. See Bullard et al. for additional comparisons with other normalization packages [3]

      For paired reads, the normalization option will show up as FPKM (Fragments per kilobase per million mapped reads) rather than RPKM. However, the calculations are the same.

    • Subtract When mean, median, Q1, Q3, std dev or sum is selected, the corresponding statistics will be calculated based on the transform on sample or features option Example: If transform on Samples is selected, Subtract mean is calculated as: TXsf = Xsf - Ms where Ms is the mean of the sample Example: If transform on Features is selected, Subtract mean is calculated as: TXsf = Xsf - Mf where Mf is the mean of the feature

    • TMM (Trimmed mean of M-values) The scaling factors is produced according to the algorithm described in Robinson et al [4]. The paper by Dillies et al. [5] contains evidence that TMM has an edge over other normalization methods. The reference sample is randomly selected. When perform the trimming, for M values (fold change), the upper 30% and lower 30% are removed; for A values (absolute expression), the upper 5% and lower 5% are removed.

    • TPM (Transcripts per million as described in Wagner et al [6]) The following steps are performed:

      1. Normalize the reads by the feature length. Here length is measured in kilobases but the final TPM values do not depend on the length unit. RPKsf = Xsf / Lf;

      2. Obtain a scaling factor for sample s as Ks = 10-6 ∑Ff=1 RPKsf

      3. Divide raw reads by the length and the scaling factor to get TPM TXsf = Xsf / Lf / Ks

    • Upper quartile

    • The method is exactly the same as the LIMMA package [7]. The following is the simple summarization of the calculation:

      1. Remove all the features that have 0 reads in all samples.

      2. Calculate the effective library size per sample: effective library size = (raw library size (in millions))*((upper quartile for a particular sample)/ (geometric mean of upper quartiles in all the samples))

      3. Get the normalized counts by dividing the raw counts per feature by the effective library size (for the respective sample)

    References

    1. Bolstad BM, Irizarry RA, Astrand M, Speed, TP. A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics. 2003; 19(2): 185-193.

    2. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008; 5(7): 621–628.

    3. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010; 11: 94.

    4. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010; 11: R25.

    5. Dillies MA, Rau A, Aubert J et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 2013; 14(6): 671-83.

    6. Wagner GP, Kin K, Lynch VJ. Measurement of mRNA abundance using RNA-seq data. Theory Biosci. 2012; 131(4): 281-5.

    7. Ritchie ME, Phipson B, Wu D et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015; 43(15):e97.

  • Sample correlation plot

  • We recommend normalizing you data prior to running Correlation analysis, but it can be invoked on any counts data node.

    • Click the counts data node

    • Click the Statistics section in the toolbox

    • Click Correlation

    • Choose the method to use for correlation analysis

    Feature many-to-one correlation

    When select Feature many-to-one correlation option and click Next, it will perform correlation of selected numeric attribute and/or feature vs every feature in the input data node one pair at a time

    When multiple numeric factors are added, the correlation analysis will perform each factor with a feature in the data node independently. If you are interested in particular features, use the Search features box to add one or more.

    • Select factor(s) or feature(s) and click Add factors to include in the statistical test.

    • Click Next

    • It is optional to apply a lowest coverage filter or configure the advanced settings

    • Click Finish to run

    Correlation analysis produces a Correlation data node; double-click to open the task report which is similar to the ANOVA/LIMMA-trend/LIMMA-voom and GSA task reports and includes a table with features on rows and statistical results on columns.

    Each numeric attribute includes p-value, adjusted p-value columns (FDR step up and/or Storey q-value if included), and a partial correlation value.

    Correlation analysis advanced options

    Multiple test correction

    Multiple test correction can be performed on the p-values of each comparison, with FDR step-up being the default. If you check the Storey q-value, an extra column with q-values will be added to the report.

    Use only reliable estimation results

    There are situations when a model estimation procedure does not fail outright but still encounters some difficulties. In this case, it can even generate p-value and fold change on the comparisons, but they are not reliable, i.e. they can be misleading. Therefore, the default of Use only reliable estimation results is set Yes.

    Correlation type

    Sets the type of correlation used to calculate the correlation coefficient and p-value. Options are Pearson (linear), Spearman (rank), Kendall (tau). Default is Pearson (linear).

    Similarity matrix

    Similarity matrix task is only available on bulk count matrix data node. It is used to compute the correlation of every sample/or feature vs every other sample/or feature. The result is a matrix with the same set of samples/or features on rows and columns, the value in the matrix is correlation coefficient --r.

    Select the computation is on samples or features and correlation method:

    Pearson: linear correlation:

    Spearman: rank correlation:

    Kendal: rank correlation:

    Click Finish to run the task. The output report of this task can be displayed in heatmap and/or table in the data viewer.

    Correlation across assays

    Correlation across assays should be used to perform correlation analysis across different modalities (e.g. ATAC-Seq enriched regions vs. RNA-Seq expression) for multiomics data analysis. It performs correlation analysis of every feature in one assay vs every feature in the other assay. We recommend the two count matrix data node should be filtered only include the features of interest to reduce the computation.

    • Select the data node to be compared to the node that the task has been invoked from using the Select data node button

    • Modify any parameters

    • Click Finish

    Correlation and similarity measures

    Features within same chromosome: this option will restrict feature comparison to the chromosome location

    All features in one data node vs all features in the other data node: this option will perform the comparison using all combinations without location constraint

    Report correlation pairs

    P-value: select a cut-off value for significance and only those pairs that meet the criteria will be reported

    abs(Correlation coefficient): select a cutoff for reporting the absolute value of the correlation coefficient (represented by the symbol r) where a perfect relationship is 1 and no relationship is 0

    Correlation across assays produces a Correlation pair list data node; double-click to open the table. The table can be sorted and filtered using the column titles.

    Click View correlation plot to open the correlation plot for each feature pair.

    Sample correlation plot

    Sample correlation plot is a data visualization used to compare all the features between two samples number of variables across two samples. Sample correlation can be performed on any count matrix data node whether it is raw counts or normalized counts. When the Sample correlation page opens, you will be asked to select two samples for comparison. The sample in the top box will be shown on the X-axis, while the sample in the bottom box will be shown on the Y-axis. Click on the sample names to select different sample and then hit Apply.

    A scatterplot is displayed on the right. Each dot is a feature (gene/transcript/protein) while the expression values in the two samples can be read off the coordinate axes, in the same units as present in the data node. Pearson correlation coefficient and regression slope results are displayed on the upper-right corner of the plot

    Feature many-to-one correlation
    Similarity matrix
    Correlation across assays

    An important aspect of survival analysis is “censored” data. Censored data refers to subjects that have not experienced the event being studied. For example, medical studies often focus on survival of patients after treatment so the survival times are recorded during the study period. At the end of the study period, some patients are dead, some patients are alive, and the status of some patients is unknown because they dropped out of the study. Censored data refers to the latter two groups. The patients who survived until the end of the study or those who dropped out of the study have not experienced the study event "death" and are listed as "censored".

    Cox Regression

    Cox regression (Cox proportional-hazards model) tests the effects of factors (predictors) on survival time. Predictors that lower the probability of survival at a given time are called risk factors; predictors that increase the probability of survival at a given time are called protective factors. The Cox proportional-hazards model are similar to a multiple logistic regression that considers time-to-event rather than simply whether an event occurred or not. Cox regression should not be used for a small sample size because the events could accidently concentrate into one of the cohorts which will not produce meaningful results. Click here to read more about hazard ratio estimation in small samples.

    Configuring the Cox Regression Dialogue

    • Open the Cox Regression task in the task menu under Statistics for any counts node.

    • Next, select the Time, Event, and Event status. Connected Multiomics will automatically guess factors that might be appropriate for these options . Click Next to proceed with the task.

    • The predictors (factors or variables) and co-predictors in the model must be defined. Co-predictors are numeric or categorical factors that will be included in the cox regression model. Time-to-event will be performed on features (e.g. genes) by default unless Use feature expression as predictor is unchecked. If unchecked, select a factor and Add factors that is not features to model a different variable. Using the default setting, Use feature expression as predictor, lets the user Add factors to the model that act to explain the relationship for time-to-event (co-predictor) in addition to features. Choose Add interaction to add co-predictors with known dependencies. If factors are added here, they cannot be added as stratification factors. Click Next to proceed with the task.

    • Next, the user can define comparisons for the co-predictors if they have been added. Configure contrasts by moving factors into the numerator (e.g. experimental factor) or denominator (e.g. control factor / reference), choose Combine or Pairwise, and add the comparison which will be displayed below. Combine all numerator levels and combine all denominator levels in a single comparison or choose Pairwise to split all numerator levels and split all denominator levels into a factorial set of comparisons meaning every numerator will be paired with every denominator. Multiple comparisons from different factors can be added with Add comparison. Low value filter can be used to filter by excluding features; choose a filter or select none. Click Next to proceed with the task.

    • The user can select categorical factors to perform stratification if needed. Stratification is needed because the proportional odds assumption holds only within each stratum, but not across the strata. When stratification factors are included, the proportional hazard assumption will hold for each combination of levels of stratification factor; a separate submodel is estimated for each level combination and the results are aggregated. Click Finish to complete the task.

    • The results of Cox regression analysis provide key information to interpret, including:

      • Hazard ratio (HR): if the HR = 0.5 then half as many patients are experiencing the event compared to the control group, if the HR = 1 the event rates are the same in both groups, and if the HR = 2 then twice as many are experiencing an event compared to the control group.

      • HR limit: this is the confidence interval of the hazard ratio.

      • P-value: the lower the p-value, the greater the significance of the observation.

    (e.g. If you have selected both a co-predictor and strata factor then a comparison using the co-predictors and Type III p-value for the co-predictor will be generated in the Cox regression report.)

    Kaplan-Meier Survival Curve

    The Kaplan-Meier task is used for comparing the survival curves among two or more groups of samples. The groups are defined by one or more categorical attributes (factors) specified by the user. Like in the case of Cox Regression, it is possible to use feature expression data, if available. In that case, quantitative feature expression is converted into a feature-specific categorical attribute. Each combination of the attribute levels corresponds to a distinct group. If one selects three factors with 2, 3 and 5 levels, respectively, then the total count of compared groups is 2*3*5 = 30. Therefore, selecting too many factors and/or factors with many levels may not work since the total number of samples may be not enough to fill all of the groups.

    To perform Kaplan-Meier survival analysis, at least two pieces of information must be provided for each sample: time-to-event (a numeric factor) and event status (categorical factor with two levels). Time-to-event indicates the time elapsed between the enrollment of a subject in the study and the occurrence of the event. Event status indicates whether the event occurred or the subject was censored (did not experience the event). The survival curve is not straight lines connecting each point, instead a staircase pattern is used. The event status will determine the staircase pattern where each drop in the staircase represents the event occurrence.

    Getting started with the Kaplan-Meier task

    The Kaplan-Meier task begins similar to the Cox regression task, then differs when selecting categorical attributes to define the compared groups.

    For each feature (e.g. gene), the expression values are sorted in ascending order and placed into B bins of (roughly) equal size. As a result, a feature-specific categorical attribute with B levels is constructed which can be used by itself or in combination with other categorical attributes. For instance, for B = 2 , we take a given feature and compute its median expression. The samples are separated into two bins, depending on whether the expression in the sample is below or above the median. if two percentiles are chosen, the bins are automatically labeled "Low" and "High" but the text box can be used to re-label the bins. The bins are feature-specific since this procedure is repeated for each feature separately.

    For each group, the survival curve (aka survival function) is estimated using Kaplan-Meier estimator [1]. For instance, if one selects ER status which has two levels and we choose two feature expression bins, four survival curves are displayed in the Data Viewer. The Grouping configuration option can be used to split and modify the connections.

    To see whether the survival curves are statistically different, Kaplan-Meier task runs Log-rank and Wilcoxon (aka Wilcoxon-Gehan) tests. The null hypothesis is that the survival curves do not differ among the groups (the computational details are available in [2]). When feature expression is used, the p-values are also feature specific. Select the step-plot icon under View to visualize the Kaplan-Meier survival curves for each gene.

    Choosing stratification factors

    Like in Cox Regression task, it is possible to choose stratification factor(s), but the purpose and meaning of stratification are not the same as in Cox Regression. Suppose we want to compare the survival among the four groups defined by the two levels of ER status and the two bins of feature expression. We can select the two factors on “Select group factor(s)” page. In that case, the reported p-values will reflect the statistical difference among the four survival curves that are due to both ER status and the feature expression. Imagine that our primary interest is the effect of feature expression on survival. Although ER status can be important and therefore should be included in the model, we want to know whether the effect of feature expression is significant after the contribution of ER status is taken into account. In other words, the goal is to treat ER status as a nuisance factor and the binned feature expression as a factor of interest.

    In qualitative terms, it is possible to obtain an answer if we group the survival curves by the level of ER status. This can be achieved in the Data Viewer by choosing Grouping > Split by under Configure. That makes it easy to compare the survival curves that have the same level of ER status and avoid the comparison of curves across different levels of ER status.

    If in the plot, we see one or more subplots where the survival curves differ a lot, that is evidence that the feature expression affects the survival even after adjusting for the contribution of ER status. To obtain an answer in terms of adjusted Log-rank and Wilcoxon p-values, one should deselect ER status as a “group factor” and mark it as a stratification factor instead.

    The computation of stratification adjusted p-values is elaborated in [2].

    Suppose when the feature expression and ER status are selected as “group factors”, Log-rank p-value is 0.001, and when ER status is marked as stratification factor, the p-value becomes 0.70. This means that ER status is very useful for explaining the difference in survival while the feature factor is of no use if ER status is already in the model. In other words, the marginal contribution of the binned expression factor is low.

    If more than two attributes are present, it is possible to measure the marginal contribution of any single factor in a similar manner: the attribute of interest should be selected as “group factor” and the other attributes should be marked as stratification factors. There is no limit on the count of factors that can be selected as “group” or stratification, except that all of the selected factors are involved in defining the groups and the groups should contain enough samples (at least, be non-empty) for the results to be reliable.

    References

    [1] Kaplan-Meier (product limit) estimator: https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator

    [2] Klein, Moeschberger (1997), Survival Analysis: Techniques for Censored and Truncated Data. ISBN-13: 978-0387948294

    Task Menu

    The Task Menu lists all the tasks that can be performed on a specific node. It can be invoked from either a Data or Task node and appears on the right hand side of the Analyses tab. It is context-sensitive, meaning that it will only present tasks that the user can perform on the selected node. For example, selecting an Differential analysis report data node will not present normalization as options.

    Clicking a Data node presents a variety of tasks:

    <figure><img src="../../../../.gitbook/assets/image (152).png" alt=""><figcaption></figcaption></figure>
    <figure><img src="../../../../.gitbook/assets/image (153).png" alt=""><figcaption></figcaption></figure>
    <figure><img src="../../../../.gitbook/assets/image (154).png" alt=""><figcaption></figcaption></figure>
    <figure><img src="../../../../.gitbook/assets/image (155).png" alt=""><figcaption></figcaption></figure>
    details

    QA/QC

    • Feature distribution

    • Imported count matrix report

    • Single-cell QA/QC

    • Cell barcode QA/QC

  • Annotation/Metadata

    • Annotate cells

    • Annotate features

    • Publish cell attributes to project

  • Pre-analysis tools

    • Pseudobulk

    • Split by feature type

    • Generate group cell counts

  • Methylation analysis

    • Detect Differential Methylation

  • Filtering

    • Filter features

    • Filter samples or cells

    • Split by attribute

  • Normalization and scaling

    • Normalization

    • Normalize to housekeeping genes

    • Normalize to baseline

  • Batch removal

    • General linear model

    • Harmony

    • Seurat3 integration

  • Statistics

    • Differential Analysis

      • DESeq2

      • Hurdle model

      • Welch's ANOVA

  • Exploratory analysis

    • Graph-based Clustering

    • K-means Clustering

    • Compare Clusters

  • Region analysis

    • Get Regional Methylation

    • Annotate Regions

  • Trajectory analysis

    • Trajectory Analysis (Monocle 2)

    • Trajectory Analysis (Monocle 3)

  • Variant Analysis

    • Annotate Variants

    • Annotate Variants (SnpEff)

    • Annotate Variants (VEP)

  • Combine multiomics data

    • Combine 5-base methylation and variant data

  • Motif Detection

  • miRNA integration

    • Get targeted mRNA

  • Biological interpretation

    • Gene Set Enrichment

    • GSEA

    • Gene set ANOVA

  • Classification

    • Classify cell type

    • Train classifier

    • ScType

  • Conversion

  • Task actions
    Data summary report

    Organize and manage data

    Analyze and interpret data

    Set and categorize samples

    Cover
    Cover
    Cover
    5-base Methylation QC
    Merge matrices
    Downsample Cells
    Scran deconvolution
    SCTransform
    TF-IDF normalization
    Impute missing values
    Impute low expression
    Impute missing values
    ANOVA/LIMMA-trend/LIMMA-voom
    Kruskal-Wallis / Wilcoxon
    Poisson/Negative binomial/GSA (Gene Specific Analysis)
    Troubleshooting
    Descriptive statistics
    Correlation
    Differential methylation
    Compute biomarkers
    Survival Analysis with Cox regression and Kaplan-Meier analysis
    Spatially Variable Genes
    PCA
    t-SNE
    UMAP
    Hierarchical Clustering
    AUCell
    Find multimodal neighbors
    SVD
    CellPhoneDB
    Merge differential expression results
    Filter Variants
    Summarize Cohort Mutations
    Combine Variants
    Remove Background Mutations
    Correlation Engine Pathway
    BANKSY - Spatial Domain Identification
    Attribute report

    Hierarchical clustering / heatmap

    Hierarchical clustering / heatmap is a statistical method used to assign similar objects into groups called clusters. It is typically performed on results of statistical analyses, such as a list of significant genes/transcripts, but can also be invoked on the full data set, as a part of exploratory analysis.

    Hierarchical clustering / heatmap is an unsupervised technique, meaning that the number of clusters is not specified upfront. In the beginning, each row and/or column is considered a cluster. The two most similar clusters are combined and continue to combine until all objects are in the same cluster. Hierarchical clustering produces a tree (called a dendrogram) that shows the hierarchy of the clusters.

    This tutorial will illustrate how to:

    • Invoking Hierarchical Clustering

    • Heatmap

    Invoking Hierarchical Clustering

    To invoke hierarchical clustering, select a data node containing count data (e.g. Gene counts, Normalized counts, Single cell counts), or a Feature list data node (to cluster significant genes/transcripts) and then click on the Hierarchical clustering / heatmap option in the context sensitive menu.

    The hierarchical clustering setup dialog enables you to control the clustering algorithm. Starting from the top, you can choose to plot a Heatmap or a Bubble map (clustering can be performed on both plot types). Next, perform Ordering by selecting Cluster for either feature order (genes/transcripts/proteins) or cell/sample/group order or both. Note the context-sensitive image that helps you decide to either perform hierarchical clustering (dendrogram) or assign order (arrow) for the columns and rows to help you orient yourself and make decisions.

    When choose Assign order, the Default order of cells/samples/groups (rows) is based upon the labels as displayed in the Data tab and features (columns) are dependent on the input data of the data node.

    Feature order can be assigned by selecting a managed list (e.g. generate saved feature lists from report nodes or add lists under list management in the settings) in the drop-down which will limit the features to only those in the list and the features will be ordered as they are listed. If a feature is not available, based on the input of the data node, it will not be shown in the plot (in other words, if the features from the list are not there they will not be plotted). Note that If no features are available from the data node, the task will not be able to perform and an error message will be shown.

    Cell/Sample/Group order can also be assigned by choosing an attribute from the drop down list. Click and drag to rearrange categorical attributes; numeric attributes can be sorted in ascending or descending order (note the arrows in the image which are different from the dendrogram for Cluster)

    Another way to invoke a heatmap without performing clustering is via the data viewer. When you select the Heatmap icon in the available plots list, data nodes that contain two-dimensional matrices can be used to draw this type of plot. A bubble map can also be similarly plotted (use the arrow from the heatmap icon to select a Bubble map for descriptive statistics that have been generated in the data analysis pipeline.

    If you do not want to cluster all the samples, but select a subset based on a specific sample or cell attribute (i.e. group membership), check Filter cells under Filtering and set a filtering rule using the drop down lists (Figure 4). Notice the drop-down lists allow more than one factor (when available) to be selected at a time. When configuring the filtering rule, use AND to ensure all conditions pass for inclusion and use OR for any conditions to pass.

    Hierarchical clustering uses distance metrics to sort based on similarity and is set to Average Linkage by default. This can be adjusted by clicking Configure under Advanced options. You can choose how the data is scaled (sometimes referred to as normalized). There are three Feature scaling options, Standardize (default for a heatmap) will make each column mean as zero and standard deviation as 1 in all features. This is the default scaling for a heatmap and it makes all of the features (e.g., genes or proteins) have equal weight; standardized values are also known as Z-scores. The scaling mode Shift will make each column mean as zero. Choose None to not scale and perform clustering on the values in the input data node (this is the default for a bubble map). If a bubble map is scaled, scaling will be performed on the group summary method (color).

    Cluster distance metric for cells/samples and features is used to determine how the distance between two clusters will be calculated:

    • Single Linkage: the distance between two clusters is determined by the distance of the closest objects in the two clusters

    • Complete Linkage: the distance between two clusters is equal to the distance between the two furthest members of those clusters

    • Average Linkage: the average distance between all the pairs of objects in the two different clusters is used as the measure of distance between the two clusters

    Point distance metric is used to determine the distance between two rows or columns.

    Heatmap

    The output of a Hierarchical clustering task can be a heatmap or a bubble map with or without dendrograms depending on whether you performed clustering on cells/samples/groups or features. By default, samples are on rows (sample labels are displayed as seen in the Data tab) and features (depending on the input data) are on columns. Colors are based on standardized expression values (default selection; performed on the fly). Dendrograms show clustering of rows (samples) and columns (variables).

    Depending on the resolution of your screen and the number of samples and variables (features) that need to be displayed, some binning may be involved. If there are more samples/genes than pixels, values of neighboring rows/columns will be averaged together. Use the mouse wheel to zoom in and out. When you zoom in to certain level on the heatmap, you will see each cell represent one sample/gene. When you mouse over the row dendrogram or label area and zoom, it will only zoom in/out on the rows. The binning on the columns will remain the same. Similarly, when you mouse over the column dendrogram or label area and zoom, it will only zoom in/out on the columns. The binning on the rows will remain the same. To move the map around when zoomed in, press down the left mouse button and drag the map. The plot can be saved as a full-size image or as a current view; when Save image is clicked, a prompt will ask how you would like to save the image.

    Bubble map

    The Hierarchical clustering task can also be used to plot a bubble map. Let's go through the steps to make a bubble map:

    • Choose to plot a Bubble map (note the selection of a bubble map in the image which is different from the heatmap). This will open the Bubble map settings.

    • Configure the Bubble map settings. First, Group cells by an available categorical attribute (e.g. cell type). Next, summarize the group’s first dimension by color (Group summary method) then choose an additional dimension to plot size (Additional statistic) by using the drop down lists. If these settings are not adjusted, the default dimensions will generate two descriptive statistic measurements that plot the group mean by color and size by the percent of cells. Hierarchical clustering can be performed on the first assigned dimension (by color) which is the Group summary method. The second dimension (size) which is an Additional statistic is not required but it is selected by default (this can be unchecked with the checkbox).

    Configuration

    There are plot Configuration/Action options for the Hierarchical clustering / heatmap task which apply to both the heatmap and bubble map in the Data viewer (below): Axes, Heatmap, Dendrograms, Annotations, and Description. Click on the icon to open these configuration options.

    Axes

    • This section controls the Content or data source used to draw the values in the heatmap or bubble map and also the ability to transpose the axes. The plot is a color representation of the values in the selected matrix. Most of the data nodes contain only one matrix, so it will just say Matrix for the chosen data node. However, if a data node contains multiple matrices (e.g. descriptive statistics were performed on cluster groups for every gene like mean, standard deviation, percent of cells, etc) each statistic will be in a separate matrix in the output data node. In this case, you can choose which statistic/matrix to display using the drop-down list (this would be the case in a bubble map).

    • To change the orientation (switch the columns and rows) of the plot, click on the Transpose toggle switch.

    • Row labels and Column labels can be turned on or off by clicking the relevant toggle switches.

    Heatmap

    This section is used to configure the color, range, size, and shape of the components in the heatmap.

    • In the color palette horizontal bar, the left side color represents the lowest value and the right side color represents the highest value in the matrix data. Note that when you zoom in/out the lowest and highest values captured by the color palette may change. By default, there are 3 color stops: minimum, middle, and maximum color value of the default range calculated on the matrix. Left-click on the middle color stop and drag left or right to change the middle value this color stop represents. If you left-click on the middle color stop once, you can change the color and value this color stop represents. Click on the (X) to remove this color stop.

    • Click on the color square or the adjacent triangle to choose a color to represent the value. This will display a color picker dialog which allows selection of a color, either by clicking or by typing a HEX color code, then clicking OK.

    • The min and max color stops cannot be dragged or removed. If you left-click on them, you can choose a different color. When you click on the Palette bar, you can add a new color stop between min and max. Adding a color stop can be useful when there is an outlier value in the data. You can use a different color to represent different value ranges.

    • Right-clicking a color stop will reveal a list of options. Space colors evenly will rearrange the position of the stops so there is an equal distance between all stops. Center will move a stop to the middle of the two adjacent stops. Delete will remove the stop.

    • In addition to color, you can also use the Size drop-down list to size by a set of values from another matrix stored in the same data node. Most of the data nodes contain only one matrix, so the only options available in the

    Dendrograms

    If cluster analysis is performed on samples and/or features, the result will be displayed as dendrograms. By default, the dendrograms are all colored in black.

    The color of the dendrograms can be configured.

    • Click on the color square or its triangle to choose a different color for the dendrogram.

    • When the By cluster in the Row/Column color drop-down list, the number of clusters needs to be specified. The top N clusters will be in N different colors.

    Annotations

    This section allows you to add sample or cell level annotations to the viewer. First, make sure to choose the correct data node which contains the annotation information you would like to use by clicking the circle (). All project level annotations will be available on all data nodes in the pipeline.

    • Choose an attribute from the Row annotation drop-down list. Multiple attributes can be chosen from the drop-down list and can be reordered by clicking and dragging the groups below the drop-down list. Each attribute is represented as an annotation bar next to the heatmap. Different colors represent the different groups in the attribute.

    • The width of the annotation bar can be changed using the Block size slider when the Show labels toggle switch is on.

    • The annotation label font size can be changed by specifying the size in pixels.

    Description

    Description is used to modify the Title and toggle on or off the Legend.

    In-plot controls

    The heatmap has several different mouse modes which modify the way the plot responds to the mouse buttons. The mode buttons are in the upper right corner of the heatmap. Clicking one of these buttons puts the heatmap into that mode.

    • In point mode (), you can left-click and drag to move around the heatmap (if you are not fully zoomed out). Left-clicking once on the heatmap or on a dendrogram branch will select the associated rows/columns.

    • In selection mode (), you can click and drag to select a range of rows, columns, or components.

    • In flip mode (), you can click on a line in the dendrogram (which represents a cluster branch) and the location of the two legs of the branch will be swapped. If no clustering is performed (no dendrogram is generated), in this mode, you can click on the label of an item (observation or feature), drag and drop to manually switch orders of the row or column on the heatmap.

    Poisson/Negative binomial/GSA

    Poisson regression is a type generalized linear model (GLM). It assume Poisson distribution where the mean equals the variance, it allows to detect differential expression on the groups with no replicates, e.g. one treated sample vs one control sample.

    Negative binomial regression is GLM too. It is used for data are over dispersed, which means the variance is greater than the mean.

    GSA stands for gene specific analysis, the goal of which is to identify the statistical model that is the best for a specific gene among all the selected models, and then use that best model to calculate p-value and fold change.

    GSA dialog

    The first step of GSA is to choose which attributes to include in the test. All sample attributes including numeric and categorical attributes are displayed in the dialog, so use the check button to select between them. An experiment with two attributes Cell type (with groups A and B)and Time (time points 0, 5, 10) is used as an example in this section.

    Click Next to display the levels of each attribute to be selected for sub-group comparisons (contrasts).

    To compare A vs. B, select A for Cell type on the top, B for Cell type on the bottom and click Add comparison. The specified comparison is added to the table below

    To compare Time point 5 vs. 0, select 5 for Time on the top, 0 for Time on the bottom, and click Add comparison.

    To compare cell types at a certain time point, e.g. time point 5, select A and 5 on the top, and B and 5 on the bottom. Thereafter click Add comparison.

    Multiple comparisons can be computed in one GSA run; Below example shows the above three comparisons are added in the computation.

    In terms of design pool, i.e. choices of model designs to select from, two 2 factors in this example data will lead to seven possibilities in the design pool:

    • Cell type

    • Time

    • Cell type, Time

    • Cell type, Cell type * Time

    In GSA, if a 2nd order interaction term is present in the design, then all first order terms must be present, which means, if Cell type * Time interaction is present, the two factors must be included in the model. In the other words, the following designs are not considered:

    • Cell type, Cell type * Time

    • Time, Cell type * Time

    • Cell type * Time

    If a comparison is added, some models that don't have the comparison factors will also be eliminated. E.g. if a comparison on Cell type A vs. B is added, only designs that have Cell type factor included will be in the computation. These are:

    • Cell type

    • Cell type, Time

    • Cell type, Time, Cell type * Time

    The more comparisons on different terms are added, the fewer models will be included in the computation. If the following comparisons are added in one GSA run:

    • A vs B (Cell type)

    • 5 vs 0 (Time)

    only the following two models will be computed:

    • Cell type, Time

    • Cell type, Time, Cell type * Time

    If comparisons on all the three terms are added in one GSA run:

    • A vs B (Cell type)

    • 5 vs 0 (Time)

    • A*5 vs B*5 (Cell type * Time)

    then only one model will be computed:

    • Cell type, Time, Cell type * Time

    GSA advanced options

    Click on Configure to customize Advanced options.

    Multiple test correction

    Multiple test correction can be performed on the p-values of each comparison, with FDR step-up being the default (1). Other options like Storey q-value (2), and Bonferroni are provided, select one method at a time; None means no multiple test correct will be performed.

    FDR step-up:

    Suppose there are n p-values (n is the number of features). The p-values are sorted by ascending order and m represents the rank of a p-value. The calculation compares p-value*(n/m) with the specified alpha level, and the cut-off p-value is the one that generates the last product that is less than the alpha level. The goal of step up method is to find:

    Define the step-up value as:

    Then an equivalent definition for K* is:

    So when

    the step up value is

    In order to find K* , start with Sn* and then go up the list until you find the first step up value that is less or equal to alpha.

    Storey q-value:

    q-value is the minimum "positive false discovery rate" (pFDR) that can occur when rejecting a statistic.

    For an observed statistic T=t and nested set of rejection area {C},

    Bonferroni:

    Suppose there are n p-values (n is the number of features), the expected number of Type I errors would be given by , thus the significance level of each individual test should be adjusted to . Alternatively the p-values should be adjusted as pB=p*n, pB is Bonferroni corrected p-value. If pB is greater than 1, it is set to 1

    Report option

    This section configures how to select the best model for a feature. There are two options for Model selection criterion: AICc (Akaike Information Criterion corrected) and AIC (Akaike Information Criterion). AICc is recommended for small sample size, while AIC is recommended for medium and large sample size What about large samples?(3). Note that when sample size grows from small to medium, AICc converges to AIC. Taking the AICc/AIC value into account, GSA considers the model with the lowest information criterion as the best choice.

    In the results, the best model's Akaike weight is also generated. The model's weight is interpreted as the probability that the model would be picked as the best if the study were reproduced. The range of Akaike weight is between 0 to 1, where 1 means the best model is very superior to the other candidates from the model pool; if the best model's Akaike weight is close to 0.5 on the other hand, it means the best model is likely to be replaced by other candidates if the study were reproduced. One still uses the best shot model, however, the accuracy of the best shot is fairly low.

    The default value for Enable multimodel approach is Yes. It means that the estimation will utilize all models in the pool by assigning weights to them based on AIC or AICc. If No is selected instead, the estimation is based on only one best model which has the smallest AIC or AICc.

    The output p-value will be different depending on the selected option for multimodel, but the fold change is the same. Multimodel approach is recommended when the best model's Akaike weight is not close to 1, meaning that the best model is not compelling.

    By default, Display shrinkage plots is set to No in the report. Choose Yes to display shrinkage plots in the report.

    There are situations when a model estimation procedure does not outright fail, but still encounters some difficulties. In this case, it can even generate p-value and fold change for the comparisons, but those values are not reliable, and can be misleading. It is recommended to use only reliable estimation results, so the default option for Use only reliable estimation results is set Yes.

    Model types configuration

    Connected Multiomics provides five response distribution types for each design model in the pool, namely:

    • Normal

    • Lognormal (the same as ANOVA task)

    • Lognormal with shrinkage (the same as limma-trend)

    • Negative binomial

    We recommend to use lognormal with shrinkage distribution (the default), and an experienced user may want to click on Custom to configure the model type and p-value type.

    If multiple distribution types are selected, then the number of total models that is evaluated for each feature is the product of the number of design models and the number of distribution types. In the above example, suppose we have only compared A vs B in Cell type, then the design model pool will have the following three models:

    • Cell type

    • Cell type, Time

    • Cell type, Time, Cell type * Time

    If we select Lognormal with shrinkage and Negative binomial, i.e. two distribution types, the best model fit for each feature will be selected from 3 * 2 = 6 models using AIC or AICc.

    The design pool can also be restricted by Min error degrees of freedom. When "Model types configuration" is set to Default , this is automated as follows: it is desirable to keep the error degrees of freedom at or above six. Therefore, we automatically set to the largest k, 0 <= k <=6 for which admissible models exist. Admissible model is one that can be estimated given the specified contrasts. In the above example, when we compare A vs B in Cell type, there are three possible design models. The error degree of freedom of model Cell type is largest and the error degree of freedom of model Cell type, Time, Cell type * Time is the smallest:

    k(Cell type) > k(Cell type, Time) > k (Cell type, Time, Cell type*Time)

    If the sample size is big, k >=6 in all three models, all the models will be evaluated and the best model will be selected for each feature. However, if the sample size is too small, none of the models will have k >=6, then only the model with maximal k will be used in the calculation. If the maximal k happens to be zero, we are forced to use Poisson response distribution only.

    There are two types of p-value, F and Wald., Poisson, negative binomial and normal models can generate p-value using either Wald or F statistics. Lognormal models always employ the F statistics; the more replicates in the study, the less the difference between the two options. When there are no replicates, only Poisson can be used to generate p-value using Wald.

    Note: Connected Multiomics keeps tracking the log status of the data, and no matter whether GSA is performed on logged data or not, the LSMeans, ratio and fold change calculation are always in linear scale. Ratio is the ratio of the two LSMeans from the two groups in the comparison (left is the numerator, right is the denominator); Fold change is converted from ratio: when ratio is greater than 1, fold change is same as ratio; when ratio is less than one, fold change is -1/ratio. In other words - fold change value is always >=1 or <=-1, there is no fold change value between -1 and 1. When the LSmean of numerator group is greater than that of denominator group, fold change is greater than 1; when LSmean of numerator group is less than denominator group, fold change is less than 1; when the group groups are the same, fold change is 1. Logratio is ratio is log2 transformed, which is equivalent to logfoldchange is some other software.

    GSA report

    If there are multiple design models and multiple distribution types included in the calculation, the fraction of genes using each model and type will be displayed as pie charts in the task result.

    Feature list with p-value and fold change generated from the best model selected is displayed in a table with other statistical information. By default, the gene list table is sorted by the first p-value column.

    The following information is included in the table by default:

    • Feature ID information: if transcript level analysis was performed, and the annotation file has both transcript and gene level information, both gene ID and transcript ID are displayed. Otherwise, the table shows only the available information.

    • Each contrast outputs p-value, FDR step up p-value, ratio and fold change in linear scale, LSmean of each group comparison in linear scale

    When you click on the Optional columns link on the top-right corner of the table, extra information will be displayed in the table when select:

    • Total counts: sum of the reads across all the observations from the input data

    • Maximum counts: maximum number of reads counts across all the observations from the input data.

    • Geometric mean: geometric mean value of the input counts across all observations.

    • Arithmetic mean: arithmetic mean value of input counts across all observations.

    Click on View extra details report () icon under View section to get more statistical information about the feature. In a case that the task doesn't fail, but certain statistical information is not generated, e.g. p-value and/or fold change of a certain comparison are not generated for some or all feature, click on this icon to get more information by mousing over the read exclamation icon

    By clicking on Optional columns, you can retrieve more statistics result information, e.g. Average coverage which is the geometric mean of normalized reads in linear scale across all the samples; fold change lower/upper limits generated from 95% confidence interval; feature annotation information if there are any more annotation fields in the annotation model you specified for quantification, like genomic location, strand information etc.

    On the left of each contrast header, there is volcano plot icon (). Select it to display the volcano plot on the chosen contrast.

    Feature list filter panel is on the left of the table. Click on the black triangle ( ) to collapse and expand the panel.

    Select the check box of the field and specify the cutoff by typing directly or using the slider. Press Enter to apply. After the filter has been applied, the total number of included features will be updated on the top of the panel (Result).

    The filtered result can be saved into a filtered data node by selecting the Generate filtered node button at the bottom of the filter panel. Save as managed list can add the list of the features in .

    Selecting the Download button at the upper-left corner of the table downloads the table as a text file to the local computer.

    If any shrinkage methods were selected for GSA, and display shrinkage plots is set to Yes in the Advanced option configure dialog, a shrinkage plot is generated in the report.

    X-axis shows the log2 value of average coverage. The plot helps to determine the threshold of low expression features. If there is an increase before a monotone decrease trend on the left side of the plot, you need to set a higher threshold on the low expression filter.

    References

    1. Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, JRSS, B, 57, 289-300.

    2. Storey JD. (2003) The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics, 31: 2013-2035.

    3. Auer, 2011, A two-stage Poisson model for testing RNA-Seq

    4. Burnham, Anderson, 2010, Model selection and multimodel inference

    Time, Cell type * Time

  • Cell type * Time

  • Cell type, Time, Cell type * Time

  • Poisson

    Law C, Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 2014 15:R29.

  • http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/index.html#cuffdiff-output-files

  • Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biology, 2010

  • list management

    Description

    Centroid
    method: the distance between two clusters is equal to the distance between the centroids of those clusters
  • Ward's method: the distance between two clusters is designed to minimize the size of an error measure based on the sum of squares

  • Ordering the plot columns (Feature order) and rows (Group order) behaves the same as a heatmap. In this example, Ordering for both features and groups by Cluster uses hierarchical clustering to perform distance metrics (default settings will be used but these metrics can be changed under Configure in the Advanced options section). Alternatively, Assign order to features using a managed (saved) feature list or the default order which is dependent on the input data. Assign order to groups can be used to rearrange the attribute by drag and drop, ascending or descending order, or default order which is how the labels as displayed in the Data tab.

  • Filtering can be applied to the groups by checking Filter cells then specifying the logical operations to filter by (this is the same as a heatmap).

  • Advanced options let the user perform Feature scaling (e.g. Standardize by a z-score) but in a bubble map the default is set to None. It also allows the user to change the Group clustering and Feature clustering options by altering the Cluster distance metrics and Point distance metrics (similar to a heatmap).

  • The label size can be changed by specifying the number of pixels using Max size and Font. If an Ensembl annotation model has been associated with the data, you can choose to display the gene name or the Ensembl ID using the Content option.

  • Size
    drop down will be
    None
    or
    Matrix
    . In cases where you have multiple matrices, you might want to use the color of the component in the heatmap to represent one type of statistic (like mean of the groups) and the size of the component to represent the information from a different statistic (like std. dev).
  • The shape of the heatmap cell (component) can be configured either as a rectangle or circle by selecting the radio button under Shape.

  • The Fill blocks toggle switch adds or removes color from the annotation labels.

    Click on reset view () to reset to the default

  • Save Image icon () enables you to download the heat map to your local computer. If the heat map contains up to 2.5M cells (features * observations), you can choose between saving the current appearance of the heat map window (Current view) and saving the entire heat map (All data). Depending on the number of features / observations, Partek Flow may not be able to fit all the labels on the screen, due to the limit imposed by the screen resolution. All Data option provides an image file of sufficient size so that all the labels are readable (in turn, that image may not fit the compute screen and the image file may be quite large). If the heat map exceeds 2.5M cells, the Current view option will not be shown, and you will see only a dialog like the one below.

  • After selecting either Current view (if applicable) or All data button, the next dialog (below) will allow you to specify the image format, size, and resolution.

  • Bubble map
    Configuration
    Axes
    Dendrograms
    Annotations
    In-plot controls
    <figure><img src="../../../../.gitbook/assets/image (185).png" alt=""><figcaption></figcaption></figure>
    <figure><img src="../../../../.gitbook/assets/image (187).png" alt=""><figcaption></figcaption></figure>
    <figure><img src="../../../../.gitbook/assets/image (188).png" alt=""><figcaption></figcaption></figure>