python_bioinformagicks.tools package#

Module contents#

calc_jasmine_score(adata: AnnData, genes: list[str], layer: str = None, recalculate_rank: bool = False, save_rank: bool = True)[source]#

Calculates the JASMINE gene set score, originally proposed in Noureen et al. 2022.

Shows less (but non-zero) dependence on gene detection rate than alternative methods for gene set scoring in scRNA-seq data.

Parameters#

adata: ad.AnnData

The AnnData object

genes: list of str

The keys in adata.var.index to score

layer: str (default: None)

The key in adata.layers to pull the gene expression matrix from. If None, defaults to adata.X

recalculate_rank: bool (default: False)

If ranks_nonzero is in adata.layers.keys(), then recalculate the gene ranks before calculating JASMINE score.

save_rank: bool (default: True)

Write the (sparse matrix) result of gene ranking to adata.layers[“ranks_nonzero”] to remove the need for recomputation.

Returns#

jasmine_score: np.ndarray

1D array of len(adata.obs.index) of gene set scores, min-max normalized to [0,1].

References#

Usage#

>>> genes_of_interest = ["Sftpc", "Lamp3", "Napsa"]
>>> adata.obs["AT2 score"] = calc_jasmine_score(adata, genes_of_interest)
>>> adata.obs.groupby("celltype")["AT2 score"].mean().idxmax()
'AT2'
call_scSNP(cell_barcodes: list[str], bam_file: AlignmentFile | str, contig: str, position: int, alt: str, threshold: int = 9, cb_tag: str = 'CB', return_stats: bool = False)[source]#

With a list of valid cell barcodes as reference, scan a BAM file for reads aligning to a given genomic position and perform variant calling at that position for each cell.

This function is crude and useful only for single point mutation calling, i.e. “I know there is a mutation this position; tell me which cells have it.” Consider using more comprehensive tools such as cellsnp-lite.

Parameters#

cell_barcodes: list of str

A list of valid cell barcodes

bam_file: pysam.AlignmentFile or str

An instantiated pysam.AlignmentFile object representing an open, indexed BAM file. If str, function will open using: f = pysam.AlignmentFile(bam_file, “rb”)

contig: str

The contig containing the genomic position of interest.

position: int > 0

The 1-indexed position on the contig of the genomic position of interest.

alt: str

The alternate nucleotide. Number of reads with this alt nucleotide will be compared to total number of reads to perform variant calling. One of [“A”, “G”, “T”, “C”].

threshold: int (default: 9)

The minimum number of reads covering the region of needed to perform variant calling. If #reads < threshold, call NA. If #reads(ALT) < (threshold/3), call WT. Else, call alt allele.

cb_tag: str (default: “CB”)

The tag in the BAM file that stores the (corrected) cell barcodes. Usually one of [“CB”, “CR”].

return_stats: bool (default: False)

If True, return dictionary of variant calling statistics, including per-cell allele frequencies.

Returns#

variant_calls: list of str

List corresponding 1-to-1 with list of cell barcodes.

stats: dict (when return_stats is True)

Dictionary of variant calling statistics

References#

do_gprofiler_analysis(genes: list[str], organism: str = 'hsapiens', max_term_size: int = 10000, ordered: bool = False, sources: list[str] = ['GO:BP', 'GO:MF', 'REAC', 'KEGG', 'TF'], highlight: bool = True)[source]#

Given an ordered list of genes, performs an over-representation analysis (ORA) using the gProfiler API.

Parameters#

genes: list of str

List of genes to query, optionally sorted by user based on FDR or another significance metric.

organism: str (default: “hsapiens”)

The organism ID to use for the query. See https://biit.cs.ut.ee/gprofiler/page/organism-list for a list of organism IDs supported by gProfiler. Commonly one of: [“hsapiens”, “mmusculus”]

max_term_size: uint (default: 10000)

Filter results table to only include those results with term sizes less than this maximum. Set to None to disable.

ordered: bool (default: False)

If the list of genes is ordered by descending significance, set to True, otherwise set to False. A slightly different ORA is performed on ordered gene lists.

sources: list of str (default: [“GO:BP”, “GO:MF”, “REAC”, “KEGG”, “TF”])

The list of source databases to consider for gProfiler ORA. Some may only be available for certain organisms; see organism list page in the gProfiler documentation for more information.

highlight: bool (default: False)

If True, request that gProfiler add a ‘highlighted’ column to the resulting dataframe indicating if a given term is a ‘driver’ term. See: https://biit.cs.ut.ee/gprofiler/page/docs#highlight_go

Returns#

ora_df: pandas.DataFrame

The resulting gProfiler results dataframe.

References#

Usage#

>>> diff_exp_genes = ["SFTPC", "LAMP3", "NAPSA", "SFTPB", "EPCAM", "COL1A1"]
>>> ora_df = do_gprofiler_analysis(diff_exp_genes, ordered=False)
>>> print(ora_df.head(1)[["source", "name", "p_value", "intersections"]])
    source                 name   p_value          intersections
0   REAC  Surfactant metabolism  0.000033  [SFTPC, NAPSA, SFTPB]
scale_by_group(adata: AnnData, groupby: str = 'celltype', layer: str = None, copy: bool = True, zero_center: bool = False)[source]#

Performs z-standard scaling, similar to sc.pp.scale, but for each category in adata.obs[groupby] independently.

For instance, if groupby == "celltype", the resulting data matrix of z-scaled expression values would highlight changes from the mean of all cells within each celltype (but changes between celltypes would be largely meaningless).

Parameters#

adata: ad.AnnData

groupby: str (default: “celltype”)

The categorical column in adata.obs containing groups to subset and standard scale within.

layer: str (default: None)

The layer key in adata.layers whose values should be scaled (i.e. the starting matrix). If None, use adata.X.

zero_center: bool (default: False)

As in sc.pp.scale; when True, subtract the mean gene expression within each group. When False, the sparse structure of the gene expression data remains.

copy: bool (default: True)

If True, return a copy of the resulting scaled data matrix. If False, modify input adata by placing resulting matrix in adata.layers[scaled_by_ + str(groupby)].

Returns#

If copy is True, return the scaled matrix.

If copy is False, scaled matrix is placed in adata.layers["scaled_by_" + str(groupby)], which modifies adata in-place.

If zero_center is False, resulting scaled matrix is of type scipy.sparse.csr_matrix, else result is of type np.ndarray (dense).

subset_by_coordinates(adata: AnnData, x_bounds: list = [], y_bounds: list = [], use_rep: str = 'X_umap')[source]#

Given a bounding box in 2D space, return a cell mask containing only cells inside of that box.

Useful for subsetting data using a UMAP/other projection based on location rather than categorical/quantitative conditions (i.e. “I want cells from that bottom-right corner cluster only.”)

TODO: generalize to other/arbitrary bounding shapes

Parameters#

adata: ad.AnnData

The parent anndata object to subset

x_bounds: list/tuple (default: [])

The minimum and maximum bounds of the bounding box along the x-axis, i.e. [x_min, x_max]. Expressed as a fraction of the axis’ actual minimum and maximum span, i.e. in the range [0,1].

y_bounds: list/tuple (default: [])

As for x_bounds, but along the y-axis

use_rep: str (default: “X_umap”)

The projection to subset on. Only the first 2 dimensions of this projection are used during subsetting.

Returns#

mask: np.ndarray

The boolean mask on length adata.n_obs where cells within the bounding area are given value True, else False.

subset_by_geosketching(adata: AnnData, n_cells_to_keep: int = None, frac_cells_to_keep: float = 0.33, use_rep: str = 'X_pca')[source]#

Subsamples a single-cell dataset using geometric sketching to more fairly represent rare and common cell types

Defaults to keeping a fraction of cells. To use an exact target cell number, specify n_cells_to_keep, which will take priority over frac_cells_to_keep.

References:

Parameters#

adata: ad.AnnData

The original anndata object, with use_rep calculated and stored in the adata.obsm[use_rep] slot.

n_cells_to_keep: int (default: None)

The number of cells to keep in the subsampled adata object. Must be less than len(adata.obs.index).

frac_cells_to_keep: float (0,1)

Fraction of cells to keep; overridden by n_cells_to_keep when not None.

use_rep: str (default: “X_pca”)

The latent space representation to use. Must be a key in adata.obsm.

Returns#

mask: list[bool]

A boolean mask of length len(adata.obs.index) where True indicates which cells to keep after geosketching.

tf_idf_markers(adata: AnnData, groupby: str, n_genes: int = 10, n_genes_to_plot: int = 10, redo_threshold: bool = False, dendrogram: bool = False, save_markers: bool = False, return_fig: bool = False, var_key_to_ignore: str = None, **kwargs)[source]#

Performs term-frequency, inverse-document frequency calculation in order to rank marker genes across groups in groupby. This is an alternative to mean expression based DEG testing that prioritizes uniqueness of genes over fold-change in expression across groups.

TODO: make compatible with scanpy.tl.rank_genes_groups

Parameters#

adata: ad.AnnData

The anndata object for this experiment

groupby: str

The column in adata.obs that we will split the data on for the TF-IDF test.

n_genes: int > 0 (default: 10)

The number of top genes per group to return in the marker gene dictionary

n_genes_to_plot: int in range (0, n_genes] (default: 10)

The number of genes per group to plot in the resulting dotplot. Set to 0 to disable plot generation.

redo_threshold: bool (default: False)

If adata.layers["binary"] exists, redo the calculation of the binary matrix. Possibly useful when performing TF-IDF ranking on a subset of adata.

save_markers: bool (default: False)

If True, store the marker gene matrix in the adata object as adata.uns["TF_IDF_" + groupby]

return_fig: bool (default: False)

If True, output of this function changes to a tuple of dict, dpl (dotplot object from scanpy) instead of just dict. Otherwise, dotplot is generated and shown but not returned from this function.

var_key_to_ignore: str (default: None)

The name of a boolean column in adata.var which indicates genes that should be ignored during marker gene testing (often mitochondrial genes, pseudogenes, ribosomal genes, etc.). If None, no gene masking is performed.

kwargs:

Arguments passed to sc.pl.dotplot (if n_genes_to_plot>0)

Returns#

marker_dict: dict

The marker gene dictionary. Keys are adata.obs[groupby].unique()

dpl: sc.pl.DotPlot (optional)

If plot is True and return_fig is True, scanpy DotPlot object is also returned for further processing/saving.

References#