pegasus.aggregate_matrices

pegasus.aggregate_matrices(csv_file, restrictions=[], attributes=[], default_ref=None, append_sample_name=True, select_singlets=False, remap_string=None, subset_string=None, min_genes=None, max_genes=None, min_umis=None, max_umis=None, mito_prefix=None, percent_mito=None)[source]

Aggregate channel-specific count matrices into one big count matrix.

This function takes as input a csv_file, which contains at least 2 columns — Sample, sample name; Location, file that contains the count matrices (e.g. filtered_gene_bc_matrices_h5.h5), and merges matrices from the same genome together. If multi-modality exists, a third Modality column might be required. An aggregated Multimodal Data will be returned.

If csv_file is a dictionary, it should contain 2 keys: Sample for sample names, and Object for Multimodal data objects. Besides, all the keys in the dictionary must keep values as lists of the same length. In this case, the objects will be merged into one data object. In addition, aggregate_matrices will make copies instead of editing the objects.

The csv_file can optionally contain two columns - nUMI and nGene. These two columns define minimum number of UMIs and genes for cell selection for each sample. The values in these two columns overwrite the min_genes and min_umis arguments.

Parameters
  • csv_file (str) – The CSV file containing information about each channel. Alternatively, a dictionary or pd.Dataframe can be passed.

  • restrictions (list[str] or str, optional (default: [])) – A list of restrictions used to select channels, each restriction takes the format of name:value,…,value or name:~value,..,value, where ~ refers to not. If only one restriction is provided, it can be provided as a string instead of a list.

  • attributes (list[str] or str, optional (default: [])) – A list of attributes need to be incorporated into the output count matrix. If only one attribute is provided, this attribute can be provided as a string instead of a list.

  • default_ref (str, optional (default: None)) – Default reference name to use. If there is no Reference column in the csv_file, a Reference column will be added with default_ref as its value. This argument can also be used for replacing genome names. For example, if default_ref is ‘hg19:GRCh38,GRCh38’, we will change any genome with name ‘hg19’ to ‘GRCh38’ and if no genome is provided, ‘GRCh38’ is the default.

  • append_sample_name (bool, optional (default: True)) – By default, append sample_name to each channel. Turn this option off if each channel has distinct barcodes.

  • select_singlets (bool, optional (default: False)) – If we have demultiplexed data, turning on this option will make pegasus only include barcodes that are predicted as singlets.

  • remap_string (str, optional, default None) – Remap singlet names using <remap_string>, where <remap_string> takes the format “new_name_i:old_name_1,old_name_2;new_name_ii:old_name_3;…”. For example, if we hashed 5 libraries from 3 samples sample1_lib1, sample1_lib2, sample2_lib1, sample2_lib2 and sample3, we can remap them to 3 samples using this string: “sample1:sample1_lib1,sample1_lib2;sample2:sample2_lib1,sample2_lib2”. In this way, the new singlet names will be in metadata field with key ‘assignment’, while the old names will be kept in metadata field with key ‘assignment.orig’.

  • subset_string (str, optional, default None) – If select singlets, only select singlets in the <subset_string>, which takes the format “name1,name2,…”. Note that if –remap-singlets is specified, subsetting happens after remapping. For example, we can only select singlets from sampe 1 and 3 using “sample1,sample3”.

  • min_genes (int, optional, default: None) – Only keep cells with at least min_genes genes.

  • max_genes (int, optional, default: None) – Only keep cells with less than max_genes genes.

  • min_umis (int, optional, default: None) – Only keep cells with at least min_umis UMIs.

  • max_umis (int, optional, default: None) – Only keep cells with less than max_umis UMIs.

  • mito_prefix (str, optional, default: None) – Prefix for mitochondrial genes.

  • percent_mito (float, optional, default: None) – Only keep cells with percent mitochondrial genes less than percent_mito % of total counts. Only when both mito_prefix and percent_mito set, the mitochondrial filter will be triggered.

Returns

The aggregated count matrix as an MultimodalData object.

Return type

MultimodalData object.

Examples

>>> data = aggregate_matrices('example.csv', restrictions=['Source:pbmc', 'Donor:1'], attributes=['Source', 'Platform', 'Donor'])
>>> data = aggregate_matrices({'Sample': ['sample1', 'sample2'], 'Object': [data1, data2]})