pegasus.aggregate_matrices

pegasus.aggregate_matrices(csv_file, restrictions=[], attributes=[], default_ref=None, append_sample_name=True, select_singlets=False, remap_string=None, subset_string=None, min_genes=None, max_genes=None, min_umis=None, max_umis=None, mito_prefix=None, percent_mito=None)[source]

Aggregate channel-specific count matrices into one big count matrix.

This function takes as input a csv_file, which contains at least 2 columns — Sample, sample name; Location, file that contains the count matrices (e.g. filtered_gene_bc_matrices_h5.h5), and merges matrices from the same genome together. If multi-modality exists, a third Modality column might be required. An aggregated Multimodal Data will be returned.

Parameters
  • csv_file (str) – The CSV file containing information about each channel. Alternatively, a dictionary or pd.Dataframe can be passed.

  • restrictions (list[str] or str, optional (default: [])) – A list of restrictions used to select channels, each restriction takes the format of name:value,…,value or name:~value,..,value, where ~ refers to not. If only one restriction is provided, it can be provided as a string instead of a list.

  • attributes (list[str] or str, optional (default: [])) – A list of attributes need to be incorporated into the output count matrix. If only one attribute is provided, this attribute can be provided as a string instead of a list.

  • default_ref (str, optional (default: None)) – Default reference name to use. If there is no Reference column in the csv_file, a Reference column will be added with default_ref as its value. This argument can also be used for replacing genome names. For example, if default_ref is ‘hg19:GRCh38,GRCh38’, we will change any genome with name ‘hg19’ to ‘GRCh38’ and if no genome is provided, ‘GRCh38’ is the default.

  • append_sample_name (bool, optional (default: True)) – By default, append sample_name to each channel. Turn this option off if each channel has distinct barcodes.

  • select_singlets (bool, optional (default: False)) – If we have demultiplexed data, turning on this option will make pegasus only include barcodes that are predicted as singlets.

  • remap_string (str, optional, default None) – Remap singlet names using <remap_string>, where <remap_string> takes the format “new_name_i:old_name_1,old_name_2;new_name_ii:old_name_3;…”. For example, if we hashed 5 libraries from 3 samples sample1_lib1, sample1_lib2, sample2_lib1, sample2_lib2 and sample3, we can remap them to 3 samples using this string: “sample1:sample1_lib1,sample1_lib2;sample2:sample2_lib1,sample2_lib2”. In this way, the new singlet names will be in metadata field with key ‘assignment’, while the old names will be kept in metadata field with key ‘assignment.orig’.

  • subset_string (str, optional, default None) – If select singlets, only select singlets in the <subset_string>, which takes the format “name1,name2,…”. Note that if –remap-singlets is specified, subsetting happens after remapping. For example, we can only select singlets from sampe 1 and 3 using “sample1,sample3”.

  • min_genes (int, optional, default: None) – Only keep cells with at least min_genes genes.

  • max_genes (int, optional, default: None) – Only keep cells with less than max_genes genes.

  • min_umis (int, optional, default: None) – Only keep cells with at least min_umis UMIs.

  • max_umis (int, optional, default: None) – Only keep cells with less than max_umis UMIs.

  • mito_prefix (str, optional, default: None) – Prefix for mitochondrial genes.

  • percent_mito (float, optional, default: None) – Only keep cells with percent mitochondrial genes less than percent_mito % of total counts. Only when both mito_prefix and percent_mito set, the mitochondrial filter will be triggered.

Returns

The aggregated count matrix as an MultimodalData object.

Return type

MultimodalData object.

Examples

>>> data = aggregate_matrix('example.csv', restrictions=['Source:pbmc', 'Donor:1'], attributes=['Source', 'Platform', 'Donor'])