1 of 6

QA/QC

Connected Multiomics contains a number of quality control tools and reports that can be used to evaluate the current status of your analysis and decide downstream steps. Quality control tools are organized under the Quality Assurance / Quality Control (QA/QC) section of the context-sensitive menu and are available for different type of data nodes.

This section will illustrate:

Feature distribution
Imported count matrix report
Single-cell QA/QC
Cell barcode QA/QC
5-base Methylation QC

In addition to the tools listed above, many other functionalities can also be interpreted in sense of quality control. For instance, principal components analysis, hierarchical clustering (on sample level), variant detection report, and quantification report.

Feature distribution

The Feature distribution plot visualizes the distribution of features in a counts matrix data node.

Running Feature distribution

To run Feature distribution:

Click a counts data node
Click the QA/QC section of the toolbox
Click Feature distribution

A new task node is generated with the Feature distribution report.

Feature distribution plot configuration

The Feature distribution task report plots the distribution of all features (genes or proteins) in the input data node with one feature per row. Features are ordered by average value in descending order.

The plot can be configured using the panel of the left-hand side of the page.

Filter

Using the filter, you can choose which features are shown in the task report.

The Manual filter lets you type a feature ID (such as a protein ID) and filter to matching features by clicking + . You can add multiple feature IDs to filter to multiple features.

The List filter lets you filter to the features included in a feature list. To learn more about feature lists, please see List management.

Plot type

Distributions can be plotted as histograms, which is the default setting, with the x-axis being the expression value and the y-axis the frequency, or as a strip plot, where the x-axis is the expression value and the position of each cell/sample is shown as a thin vertical line, or strip, on the plot.

To switch between plot types, use the Plot type radio buttons.

Mousing over a dot in the histogram plot gives the range of feature values that are being binned to generate the dot and the number of cells/samples for that bin in a pop-up.

Mousing over a strip shows the sample ID and feature value in a pop-up. If there are multiple cells/samples with the same value, only one strip will be visible for those cells/samples and the mouse-over will indicate how many cells/samples are represented by that one strip.

Clicking a strip will highlight that cell/sample in all of the plots on the page. The grey dot in each strip plot shows the median value for that feature. To view the median value, mouse over the dot.

Page

To navigate between pages, use the Previous and Next buttons or type the page number in the text field and click Enter on your keyboard.

The number of features that appear in the plot on each page is set by the Items per page drop-down menu. You can choose to show 10, 25, or 50 features per page.

Color by

You can add attribute information to the plots using the Color by drop-down menu.

For histogram plots, the histograms will be split and colored by the levels of the selected attribute. You can choose any categorical attribute.

For strip plots, the sample/cell strips will be colored by the levels or values of the selected attribute. You can choose any categorical or numeric attribute.

Imported count matrix report

The imported count matrix report is a summary report on sample distribution information of imported counts matrix data e.g. ILMN miRNA count matrix data

Click an imported miRNA data node
Click the QA/QC section of the toolbox
Click Imported count matrix report

A new task node is generated with the Imported count matrix report. Double click on the report to open it.

In the Feature distribution table title, it displays the size of the matrix, number of samples and number of features. Each row is a sample in the table, columns contains descriptive statistics of features in the sample.

If there are less than 30 samples in the data node, a bar chart is presented. Each bar is a sample. The X-axis is the read count range, Y axis is the number of features within the range. Hovering your mouse over the bar displays the following information:

Sample name
Range of read counts, “[ “represent inclusive, “)” represent exclusive, e.g. [0,0] means 0 read counts; (0,10] means the range is greater than 0 count but less than and equal to 10 counts.
Number of features within the read count range
Percentage of the features within the read count range

A Box-whisker plot is displayed below the bar chart. In the box-whisker plot, each box is a sample on X-axis, the box represents 25th and 75th percentile, the whiskers represent 10th and 90th percentile, Y-axis represents the feature counts, when you hover over each box, detailed sample information is displayed

Sample name
Range of read counts, “[ “represent inclusive, “)” represent exclusive
Number of features within the read count range in the sample

Single-cell QA/QC

The Single-cell QA/QC task in Connected Multiomics enables you to visualize several useful metrics that will help you include only high-quality cells. To invoke the Single-cell QA/QC task:

Click a Single cell counts data node
Click the QA/QC section of the task menu
Click Single cell QA/QC

By default, all samples are used to perform QA/QC. You can choose Split by sample in Grouping option to perform QA/QC separately for each sample.

You will be prompted to choose the genome assembly and annotation file by the Single cell QA/QC configuration dialog.

Note, it is still possible to run the task without specifying an annotation file. If you choose not to specify an annotation file, the detection of mitochondrial counts will not be possible. The annotation file should match the same annotation file used in the upstream analysis.

The Single cell QA/QC task report opens in a new data viewer session. Four dot and violin plots showing the value of every cell on the canvas: counts per cell, detected features per cell, the percentage of mitochondrial counts per cell (when annotation file contains the genes on MT chromosome), and the percentage of ribosomal counts per cell (human and mouse only).

If your cells do not express any mitochondrial genes or an appropriate annotation file was not specified, the plot for the percentage of mitochondrial counts per cell will be non-informative.

Mitochondrial genes are defined as genes located on a mitochondrial chromosome in the gene annotation file. The mitochondrial chromosome is identified in the gene annotation file by having "M" or "MT" in its chromosome name. If the gene annotation file does not follow this naming convention for the mitochondrial chromosome, Connected Multiomics will not be able to identify any mitochondrial genes.

Ribosomal genes are defined as genes that code for proteins in the large and small ribosomal subunits. Ribosomal genes are identified by searching their gene symbol against a list of 89 L & S ribosomal genes taken from HGNC. The search is case-insensitive and includes all known gene name aliases from HGNC. Identifying ribosomal genes is performed independent of the gene annotation file specified.

Total counts are calculated as the sum of the counts for all features in each cell from the input data node. The number of detected features is calculated as the number of features in each cell with greater than zero counts. The percentage of mitochondrial counts is calculated as the sum of counts for known mitochondrial genes divided by the sum of counts for all features and multiplied by 100. The percentage of ribosomal counts are calculated as the sum of counts for known ribosomal genes divided by the sum of counts for all features and multiplied by 100.

Each point on the plots is a cell. All cells from all samples are shown on the plots. The overlaid violins illustrate the distribution of cell values for the y-axis metric.

The appearance of a plot can be configured by selecting a plot and adjusting the Configure settings in the panel on the left. Here are some suggestions, but feel free to explore the other options available:

Open Axes and change the Y-axis scale to Logarithmic. This can be helpful to view the range of values better, although it is usually better to keep the Ribosomal counts plot in linear scale.

Within Style switch on Summary Box & Whiskers. Inspecting the median, Q1, Q3, upper 90%, and lower 10% quantiles of the distributions can be helpful in deciding appropriate thresholds.

High-quality cells can be selected using Select & Filter, which is pre-loaded with the selection criteria, one for each quality metric.

Hovering the mouse over one of the selection criteria reveals a histogram showing you the frequency distribution of the respective quality metric. The minimum and maximum thresholds can be adjusted by clicking and dragging the sliders or by typing directly into the text boxes for each selection criteria.

Alternatively, Pin histogram to view all of the distributions at one time to determine thresholds with ease.

Adjusting the selection criteria will select and deselect cells in all three plots simultaneously. Depending on your settings, the deselected points will either be dimmed or gray. The filters are additive. Combining multiple filters will include the intersection of the three filters. The number of cells selected is shown in the figure legend of each plot.

To filter the high-quality cells, click the include selected cells icon in Filter in the top right of Select & Filter, and click Apply observation filter...

Select the input data node for the filtering task and click Select.

A new data node, Filtered counts, will be generated under the Analyses tab.

Double click the Filtered counts data node to view the task report. The report includes a summary of the count distribution across all features for each sample; a detailed breakdown of the number of cells included in the filter for each sample; and the minimum and maximum values for each quality metric (expressed genes, total counts, etc) across the included cells for each sample.

Cell barcode QA/QC

The Cell barcode QA/QC task lets you determine whether a given cell barcode is associated with a cell. This is an important QC step in all droplet-based single cell RNA-seq experiments, where all barcodes are sequenced.

To invoke Cell barcode QA/QC:

Click a Single cell counts data node
Click the QA/QC section of the task menu
Click Cell barcode QA/QC

The task can be performed with or without the EmptyDrops method enabled.

Cell Barcode QA/QC without EmptyDrops

To perform the task without the EmptyDrops method enabled, leave the checkbox unchecked and click Finish.

Note: Data imported from DRAGEN result is recommended to use this option since barcode with 0 counts are filtered out.

The Cell barcode QA/QC task report is a plot. X-axis is the barcodes ranked by their UMI counts. Y-axis is the UMI counts in the barcode. This type of plot is often referred to as a knee plot.

The knee plot is used to choose a cutoff point between barcodes that correspond to cells and barcodes that do not if the imported raw count data without any barcode filtering performed upstream. Connected Multiomics automatically calculates an inflection point, shown by the vertical line on the graph. Barcodes designated as cells are shown in blue while barcodes designated as without cells (background) are shown in grey.

The cutoff can be adjusted by dragging the vertical line across the graph or by using the text fields in the Filter panel on the left-hand side of the plot. Using the Filter panel, you can specify the number of cells or the percentage of reads in cells and the cutoff point will be adjusted to match your criteria. The number of cells and the percentage of counts in cells is adjusted as the cutoff point is changed. To return to the automatically calculated cutoff, click Reset sample filter.

The percentage of counts in cells and median counts per cell are useful technical quality metrics that can be consulted when optimizing sample handling, cell isolation techniques, and library preparation.

One knee plot is generated for each sample. In projects with multiple samples, Next and Back buttons will appear at the top left of the plot, to enable navigation between sample knee plots. Manual filters must be set separately for each sample. This is typically used when the user expects a certain number of cells to be processed, like in experiments where droplets were loaded with a predefined number of cells.

To return to the knee plot view, click Back to filter. To apply the filter and run the Filter barcodes task, click Apply filter. A Filtered counts data node will be generated.

Cell Barcode QA/QC with EmptyDrops

If your data has already been filtered to remove barcodes with low total counts, this method will not be suitable. This method requires empty barcodes to be present in the single cell count matrix, in order to estimate the ambient RNA profile.

The EmptyDrops method (1) uses a statistical test to identify which barcodes correspond to real cells and empty droplets. An ambient RNA expression profile is estimated from barcodes below a specified total UMI count threshold, using the Good-Turing algorithm. The expression profile of each barcode above the low-count threshold is then tested for deviations from the ambient profile. Real cells are expected to have a low p-value, indicating a significant deviation from the expected background noise level. False discovery rate (FDR) correction is applied to all the p-values and those falling equal to or below the specified FDR level are detected as real cells. This can allow for the detection of additional cells that would otherwise be discarded due to a low total UMI count.

In addition, a knee point threshold will be calculated to identify cells with a very high total UMI count. It's possible that some barcodes with a high total UMI count will not pass the EmptyDrops significance test. This could be due to biases in the ambient RNA profile, leading to a non-significant difference between a barcode's expression profile vs the ambient profile. To protect against this issue, it is advisable to use the EmptyDrops results in conjunction with the knee point filter, on the assumption that barcodes with a very high total UMI count will always correspond to real cells. Note, the knee point will be more conservative than the inflection point calculated by Connected Multiomics when the EmptyDrops method is not enabled.

To perform the task with the EmptyDrops method, check the checkbox, configure the additional options, and click Finish.

Ambient count threshold

Barcodes with a total UMI count equal to or below this threshold will be used to create the ambient RNA expression profile to estimate background noise. The default is set to 100, which is reasonable for most data.

FDR threshold

Barcodes equal to or below this FDR threshold show a significant deviation from the ambient profile and can therefore be considered real cells. Increasing this value will result in more cells, but will also increase the number of potential false positives.

Random generator seed

This is used for performing Monte Carlo simulations to determine p-values. To reproduce results, use the same random seed for all runs.

There are additional metrics on the left of the plot in the report.

The number of actual cells detected by the EmptyDrops test and the knee point filter are shown above the Venn diagram on the left. In the above example plot 3,189 barcodes are above the knee point filter (represented by the vertical blue line on the plot) and 2,657 barcodes passed the significance test in EmptyDrops. The overlap between these sets of barcodes is represented by the Venn diagram.There are 1,583 barcodes pass the significance test in EmptyDrops and have a high total UMI count above the knee point filter; 1,606 barcodes have a very high total UMI count with no significant difference from the ambient profile in EmptyDrops; 1,074 barcodes fall below the knee point but are still significantly different from the ambient profile.

The number of cells included by the knee point filter can be adjusted either by click on the plot to change the position of the vertical blue line or by typing a different number of cells into the text box on the left.

The total number of cells is shown in the text box on the left. By default, this will be all of the cells detected by the knee point filter plus the extra cells detected by EmptyDrops. In the example, there are 3,189 cells with a high total UMI count plus the additional 1,074 cells from EmptyDrops (total = 4,263).

Different sections of the Venn diagram can be selected/deselected to include/exclude barcodes. For example, clicking the '1,606' section of the Venn diagram will deselect those barcodes. Now, the only cells that will pass the filter will be the significant ones from EmptyDrops.

References

Lun, A., Riesenfeld, S., Andrews, T. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 2019; 20: 63.

5-base Methylation QC

The 5-base methylation QC task in the Connected Multiomics enables you to visualize sample-level QC metrics that describe reads mapping quality and CpG methylation calling. The QC metrics are extracted from the DRAGEN analysis metric files that were ingested into the study as required files for 5-base DNA Prep data analysis in the Connected Multiomics. To invoke the 5-base methylation QC task:

At Analyses page, click on the 5-base Methylation node.
Click QA/QC section in the context-sensitive task menu on the right.
Click 5-base methylation QC.

There is no parameters setting required for the 5-base methylation QC task. After click on the 5-base methylation QC task from the context-sensitive task menu, a task node called 5-base methylation QC report is initiated. When completed, double-click on the 5-base methylation QC report task node to open the QC report in a data viewer. The QC report consists of plots and tables organized in 2 sheets. Click on sheet name at the bottom of the data viewer to navigate from one sheet to another.

Metrics

Sheet Metrics shows sample-level QC metrics plot. Each sample is a data point, they are randomly spead out on x-axis. The QC metric is represented by y-axis. Each plot is overlay with a violin plot to show distribution of the QC metrics.

Percent methylation in samples: Percentages of CpG methylation in samples.
Percent methylation in unmethylated control: Percentage of CpG methylation in the unmethylated control (lambda). Low value indicates good quality.
Percent methylation in methylated control: Percentage of CpG methylation in the methylated control (pUC19). High value indicates good quality.
Percent duplicate reads: Percentage of duplicate marked reads, as a result of PCR amplification.
Percent mapped reads: Percentage of mapped reads, indicate the alignment rate.
Average autosomal coverage: Mean autosomal coverage across the whole genome. Higher coverage indicates the counts of methylated/unmethylated more accurately reflects the true methylation amount at any particular site.
QC metrics table: Text representations of the QC metrics plots.

In Metric sheet, samples can be selected using Selection > Select & Filter. The Select & Filter dialog is pre-loaded with the selection criteria, one for each QC metric.

Hovering the mouse over one of the selection criteria reveals a histogram showing you the frequency distribution of the respective QC metric. The minimum and maximum thresholds can be adjusted by clicking and dragging the sliders or by typing directly into the text boxes for each selection criteria.

Adjusting the selection criteria will select and deselect samples in all 6 plots simultaneously. Depending on your settings, the deselected points will either be dimmed or gray. The filters are additive. Combining multiple filters will include the intersection of the the filters. The number of samples selected is shown in the figure legend of each plot.

To filter the dataset to the selected samples, click the include selected points icon ( ) in Filter in the top right of Select & Filter, and click Apply observation filter...

Select the input data node for the filtering task and click Select.

A new data node, Filtered samples, will be generated under the Analyses tab.

M-bias

Sheet M-bias shows M-bias plots for methylation level and coverage across positions on read1 and read2. The M-bias should be consistent across all positions. It is common for the first/last 10 bases to have un-even methylation due to end-repair and sequencing artifacts.

All plots in one data viewer screen can be downloaded into local computer as a single image by clicking Export button on the top of the screen. To download an individual plot into local computer, select the plot, click Plot button from the left panel within the plot, then click Export, follow the wizard to set image file format, image size, and resolution.

Single-cell QA/QC

The Single-cell QA/QC task in Connected Multiomics enables you to visualize several useful metrics that will help you include only high-quality cells. To invoke the Single-cell QA/QC task:

Click a Single cell counts data node
Click the QA/QC section of the task menu
Click Single cell QA/QC

By default, all samples are used to perform QA/QC. You can choose Split by sample in Grouping option to perform QA/QC separately for each sample.

You will be prompted to choose the genome assembly and annotation file by the Single cell QA/QC configuration dialog.

If your cells do not express any mitochondrial genes or an appropriate annotation file was not specified, the plot for the percentage of mitochondrial counts per cell will be non-informative.

Each point on the plots is a cell. All cells from all samples are shown on the plots. The overlaid violins illustrate the distribution of cell values for the y-axis metric.

Open Axes and change the Y-axis scale to Logarithmic. This can be helpful to view the range of values better, although it is usually better to keep the Ribosomal counts plot in linear scale.

Within Style switch on Summary Box & Whiskers. Inspecting the median, Q1, Q3, upper 90%, and lower 10% quantiles of the distributions can be helpful in deciding appropriate thresholds.

High-quality cells can be selected using Select & Filter, which is pre-loaded with the selection criteria, one for each quality metric.

Alternatively, Pin histogram to view all of the distributions at one time to determine thresholds with ease.

To filter the high-quality cells, click the include selected cells icon in Filter in the top right of Select & Filter, and click Apply observation filter...

Select the input data node for the filtering task and click Select.

A new data node, Filtered counts, will be generated under the Analyses tab.

Cell barcode QA/QC

To invoke Cell barcode QA/QC:

Click a Single cell counts data node
Click the QA/QC section of the task menu
Click Cell barcode QA/QC

The task can be performed with or without the EmptyDrops method enabled.

Cell Barcode QA/QC without EmptyDrops

To perform the task without the EmptyDrops method enabled, leave the checkbox unchecked and click Finish.

Note: Data imported from DRAGEN result is recommended to use this option since barcode with 0 counts are filtered out.

The Cell barcode QA/QC task report is a plot. X-axis is the barcodes ranked by their UMI counts. Y-axis is the UMI counts in the barcode. This type of plot is often referred to as a knee plot.

To return to the knee plot view, click Back to filter. To apply the filter and run the Filter barcodes task, click Apply filter. A Filtered counts data node will be generated.

Cell Barcode QA/QC with EmptyDrops

To perform the task with the EmptyDrops method, check the checkbox, configure the additional options, and click Finish.

Ambient count threshold

FDR threshold

Random generator seed

This is used for performing Monte Carlo simulations to determine p-values. To reproduce results, use the same random seed for all runs.

There are additional metrics on the left of the plot in the report.

References

Lun, A., Riesenfeld, S., Andrews, T. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 2019; 20: 63.