python_bioinformagicks.tools package#
Module contents#
- calc_jasmine_score(adata: AnnData, genes: list[str], layer: str = None, recalculate_rank: bool = False, save_rank: bool = True)[source]#
Calculates the JASMINE gene set score, originally proposed in Noureen et al. 2022.
Shows less (but non-zero) dependence on gene detection rate than alternative methods for gene set scoring in scRNA-seq data.
Parameters#
- adata: ad.AnnData
The AnnData object
- genes: list of str
The keys in adata.var.index to score
- layer: str (default: None)
The key in adata.layers to pull the gene expression matrix from. If None, defaults to adata.X
- recalculate_rank: bool (default: False)
If ranks_nonzero is in adata.layers.keys(), then recalculate the gene ranks before calculating JASMINE score.
- save_rank: bool (default: True)
Write the (sparse matrix) result of gene ranking to adata.layers[“ranks_nonzero”] to remove the need for recomputation.
Returns#
- jasmine_score: np.ndarray
1D array of len(adata.obs.index) of gene set scores, min-max normalized to [0,1].
References#
Usage#
>>> genes_of_interest = ["Sftpc", "Lamp3", "Napsa"] >>> adata.obs["AT2 score"] = calc_jasmine_score(adata, genes_of_interest) >>> adata.obs.groupby("celltype")["AT2 score"].mean().idxmax() 'AT2'
- call_scSNP(cell_barcodes: list[str], bam_file: AlignmentFile | str, contig: str, position: int, alt: str, threshold: int = 9, cb_tag: str = 'CB', return_stats: bool = False)[source]#
With a list of valid cell barcodes as reference, scan a BAM file for reads aligning to a given genomic position and perform variant calling at that position for each cell.
This function is crude and useful only for single point mutation calling, i.e. “I know there is a mutation this position; tell me which cells have it.” Consider using more comprehensive tools such as cellsnp-lite.
Parameters#
- cell_barcodes: list of str
A list of valid cell barcodes
- bam_file: pysam.AlignmentFile or str
An instantiated pysam.AlignmentFile object representing an open, indexed BAM file. If str, function will open using: f = pysam.AlignmentFile(bam_file, “rb”)
- contig: str
The contig containing the genomic position of interest.
- position: int > 0
The 1-indexed position on the contig of the genomic position of interest.
- alt: str
The alternate nucleotide. Number of reads with this alt nucleotide will be compared to total number of reads to perform variant calling. One of [“A”, “G”, “T”, “C”].
- threshold: int (default: 9)
The minimum number of reads covering the region of needed to perform variant calling. If #reads < threshold, call NA. If #reads(ALT) < (threshold/3), call WT. Else, call alt allele.
- cb_tag: str (default: “CB”)
The tag in the BAM file that stores the (corrected) cell barcodes. Usually one of [“CB”, “CR”].
- return_stats: bool (default: False)
If True, return dictionary of variant calling statistics, including per-cell allele frequencies.
Returns#
- variant_calls: list of str
List corresponding 1-to-1 with list of cell barcodes.
- stats: dict (when return_stats is True)
Dictionary of variant calling statistics
References#
- do_gprofiler_analysis(genes: list[str], organism: str = 'hsapiens', max_term_size: int = 10000, ordered: bool = False, sources: list[str] = ['GO:BP', 'GO:MF', 'REAC', 'KEGG', 'TF'], highlight: bool = True)[source]#
Given an ordered list of genes, performs an over-representation analysis (ORA) using the gProfiler API.
Parameters#
- genes: list of str
List of genes to query, optionally sorted by user based on FDR or another significance metric.
- organism: str (default: “hsapiens”)
The organism ID to use for the query. See https://biit.cs.ut.ee/gprofiler/page/organism-list for a list of organism IDs supported by gProfiler. Commonly one of: [“hsapiens”, “mmusculus”]
- max_term_size: uint (default: 10000)
Filter results table to only include those results with term sizes less than this maximum. Set to None to disable.
- ordered: bool (default: False)
If the list of genes is ordered by descending significance, set to True, otherwise set to False. A slightly different ORA is performed on ordered gene lists.
- sources: list of str (default: [“GO:BP”, “GO:MF”, “REAC”, “KEGG”, “TF”])
The list of source databases to consider for gProfiler ORA. Some may only be available for certain organisms; see organism list page in the gProfiler documentation for more information.
- highlight: bool (default: False)
If True, request that gProfiler add a ‘highlighted’ column to the resulting dataframe indicating if a given term is a ‘driver’ term. See: https://biit.cs.ut.ee/gprofiler/page/docs#highlight_go
Returns#
- ora_df: pandas.DataFrame
The resulting gProfiler results dataframe.
References#
Usage#
>>> diff_exp_genes = ["SFTPC", "LAMP3", "NAPSA", "SFTPB", "EPCAM", "COL1A1"] >>> ora_df = do_gprofiler_analysis(diff_exp_genes, ordered=False) >>> print(ora_df.head(1)[["source", "name", "p_value", "intersections"]]) source name p_value intersections 0 REAC Surfactant metabolism 0.000033 [SFTPC, NAPSA, SFTPB]
- scale_by_group(adata: AnnData, groupby: str = 'celltype', layer: str = None, copy: bool = True, zero_center: bool = False)[source]#
Performs z-standard scaling, similar to
sc.pp.scale, but for each category inadata.obs[groupby]independently.For instance, if
groupby == "celltype", the resulting data matrix of z-scaled expression values would highlight changes from the mean of all cells within each celltype (but changes between celltypes would be largely meaningless).Parameters#
adata: ad.AnnData
- groupby: str (default: “celltype”)
The categorical column in adata.obs containing groups to subset and standard scale within.
- layer: str (default: None)
The layer key in adata.layers whose values should be scaled (i.e. the starting matrix). If None, use
adata.X.- zero_center: bool (default: False)
As in
sc.pp.scale; when True, subtract the mean gene expression within each group. When False, the sparse structure of the gene expression data remains.- copy: bool (default: True)
If True, return a copy of the resulting scaled data matrix. If False, modify input adata by placing resulting matrix in
adata.layers[scaled_by_ + str(groupby)].
Returns#
If
copy is True, return the scaled matrix.If
copy is False, scaled matrix is placed inadata.layers["scaled_by_" + str(groupby)], which modifiesadatain-place.If
zero_center is False, resulting scaled matrix is of typescipy.sparse.csr_matrix, else result is of typenp.ndarray(dense).
- subset_by_coordinates(adata: AnnData, x_bounds: list = [], y_bounds: list = [], use_rep: str = 'X_umap')[source]#
Given a bounding box in 2D space, return a cell mask containing only cells inside of that box.
Useful for subsetting data using a UMAP/other projection based on location rather than categorical/quantitative conditions (i.e. “I want cells from that bottom-right corner cluster only.”)
TODO: generalize to other/arbitrary bounding shapes
Parameters#
- adata: ad.AnnData
The parent anndata object to subset
- x_bounds: list/tuple (default: [])
The minimum and maximum bounds of the bounding box along the x-axis, i.e. [x_min, x_max]. Expressed as a fraction of the axis’ actual minimum and maximum span, i.e. in the range [0,1].
- y_bounds: list/tuple (default: [])
As for x_bounds, but along the y-axis
- use_rep: str (default: “X_umap”)
The projection to subset on. Only the first 2 dimensions of this projection are used during subsetting.
Returns#
- mask: np.ndarray
The boolean mask on length adata.n_obs where cells within the bounding area are given value True, else False.
- subset_by_geosketching(adata: AnnData, n_cells_to_keep: int = None, frac_cells_to_keep: float = 0.33, use_rep: str = 'X_pca')[source]#
Subsamples a single-cell dataset using geometric sketching to more fairly represent rare and common cell types
Defaults to keeping a fraction of cells. To use an exact target cell number, specify n_cells_to_keep, which will take priority over frac_cells_to_keep.
References:
Parameters#
- adata: ad.AnnData
The original anndata object, with use_rep calculated and stored in the
adata.obsm[use_rep]slot.- n_cells_to_keep: int (default: None)
The number of cells to keep in the subsampled adata object. Must be less than
len(adata.obs.index).- frac_cells_to_keep: float (0,1)
Fraction of cells to keep; overridden by n_cells_to_keep when not None.
- use_rep: str (default: “X_pca”)
The latent space representation to use. Must be a key in
adata.obsm.
Returns#
- mask: list[bool]
A boolean mask of length len(adata.obs.index) where True indicates which cells to keep after geosketching.
- tf_idf_markers(adata: AnnData, groupby: str, n_genes: int = 10, n_genes_to_plot: int = 10, redo_threshold: bool = False, dendrogram: bool = False, save_markers: bool = False, return_fig: bool = False, var_key_to_ignore: str = None, **kwargs)[source]#
Performs term-frequency, inverse-document frequency calculation in order to rank marker genes across groups in groupby. This is an alternative to mean expression based DEG testing that prioritizes uniqueness of genes over fold-change in expression across groups.
TODO: make compatible with
scanpy.tl.rank_genes_groupsParameters#
- adata: ad.AnnData
The anndata object for this experiment
- groupby: str
The column in adata.obs that we will split the data on for the TF-IDF test.
- n_genes: int > 0 (default: 10)
The number of top genes per group to return in the marker gene dictionary
- n_genes_to_plot: int in range (0, n_genes] (default: 10)
The number of genes per group to plot in the resulting dotplot. Set to 0 to disable plot generation.
- redo_threshold: bool (default: False)
If
adata.layers["binary"]exists, redo the calculation of the binary matrix. Possibly useful when performing TF-IDF ranking on a subset of adata.- save_markers: bool (default: False)
If True, store the marker gene matrix in the adata object as
adata.uns["TF_IDF_" + groupby]- return_fig: bool (default: False)
If True, output of this function changes to a tuple of dict, dpl (dotplot object from scanpy) instead of just dict. Otherwise, dotplot is generated and shown but not returned from this function.
- var_key_to_ignore: str (default: None)
The name of a boolean column in adata.var which indicates genes that should be ignored during marker gene testing (often mitochondrial genes, pseudogenes, ribosomal genes, etc.). If None, no gene masking is performed.
- kwargs:
Arguments passed to
sc.pl.dotplot(if n_genes_to_plot>0)
Returns#
- marker_dict: dict
The marker gene dictionary. Keys are
adata.obs[groupby].unique()- dpl: sc.pl.DotPlot (optional)
If
plot is Trueandreturn_fig is True, scanpy DotPlot object is also returned for further processing/saving.
References#