python_bioinformagicks.utilities package#

Module contents#

get_grouped_categories(df: DataFrame, groupby: str, column_to_reorder: str, sort: bool = False)[source]#

Given a dataframe, subset by the groupby column, identify unique values in the grouped column_to_reorder, and return a flat list of unique categories, ordered first by parent group then by initial category ordering (or natsorted ordering).

Useful for resetting categories of the column_to_reorder.

Parameters#

df: pd.Dataframe

The dataframe. Must contain categorical columns groupby and column_to_map

groupby: str

The column name in df to group/subset by. Must be a categorical column.

column_to_reorder: str

The column name in df to reorder. Must be a categorical column.

sort: bool (default: False)

If True, natsort the categories in each group before insertion into final list. If False, leave the original category ordering.

Returns#

new_category_order: list of str

The reordered categories.

Usage#

Here we will reorder categories in df["celltype"] by first grouping by df["compartment"]. In this case, since the Endothelial compartment is the first category in df["compartment"], the celltypes belonging to that compartment will be ordered before those of the next compartment (Epithelial), and so on.

>>> df["celltype"].cat.categories.tolist()
['ASM', 'AT1', 'AT1 | AT2', 'AT2', 'Alveolar Macrophage', ...]
>>> df["compartment"].cat.categories.tolist()
['Endothelial', 'Epithelial', 'Immune', 'Mesenchymal']
>>> new_order = get_grouped_categories(df, "compartment", "celltype", sort=True)
>>> df["celltype"] = df["celltype"].cat.reorder_categories(new_order)
>>> df["celltype"].cat.categories.tolist()
['Artery', 'Lymph', 'Proliferating gCap', 'Vein', 'aCap', 'gCap', 'AT1', ...]
get_proportions(df: DataFrame, outer_col: str, inner_col: str, return_counts: bool = False)[source]#

Calculates how many items from each outer_col are also in inner_col as a fraction of the total items in outer_col.

Parameters#

df: pd.DataFrame

The dataframe containing at least the columns outer_col and inner_col.

outer_col: str

The column name of the outermost column; often batch, sample, or genotype.

inner_col: str

The column name of the innermost column; often celltype, leiden.

return_counts: bool (default: False)

If True, return the number of cells, otherwise return the fraction.

in_ignore_list(g: str)[source]#

Given a gene symbol g, return True if this gene is any of the following:

  • A mitochondrial gene

  • A ribosomal gene

  • A hemoglobin gene

  • A lncRNA gene

  • An antisense gene

  • A microRNA

  • An uncharacterized/predicted gene

These genes are often uninformative or not a focus of study in transcriptomic analyses, and their inclusion in differential expression testing results can be distracting to readers.

This function identifies such nuiscance genes based on gene symbol alone and as such may miss or include genes erronously.

Parameters#

g: str

The gene symbol to test

Returns#

True if g is in any of those categories, else False.

Usage#

>>> in_ignore_list("Gm12941")
True
>>> in_ignore_list("Actb")
False
>>> adata.var["ignore"] = [in_ignore_list(g) for g in adata.var.index]
make_combined_categorical_column(df: DataFrame, col_a: str, col_b: str, category_order: list[str] = None)[source]#

Given a dataframe df and two column labels, generates a new categorical series as a combination of the two columns.

Roughly a wrapper around the following code: (df[col_a].astype(str) + “ “ + df[col_b].astype(str)).astype(“category”)

Parameters#

df: pandas.DataFrame

The input dataframe

col_a, col_b: str

Names of columns in df to combine. Must be str or categorical type columns.

category_order: list of str (default: None)

The order of categories in the new combined column. New column categories are always separated by spaces, i.e.: col_a_val + “ “ + col_b_val New category order must include only all possible category new values.

When None, ordered with first column order as parent, second column order as child, i.e. [“a0 b0”, “a0 b1”, …, “a1 b0”, …]

Returns#

new_col: pd.Series

The new categorical series

Usage#

Here, df["age"] and df["phase"] are categorical columns that have had their categories manually ordered by the user. We first demonstrate a simple approach to combining the columns that will default to alphabetical category ordering. We then compare to the result of this function, which maintains the original category ordering.

>>> df["age"].cat.categories.tolist()
['E12', 'E15', 'E17', 'E19', 'P3', ...]
>>> df["phase"].cat.categories.tolist()
['G1', 'S', 'G2M']
>>> df["age_phase_unordered"] = df["age"].astype(str) + " " + df["phase"].astype(str)
>>> df["age_phase_unordered"] = df["age_phase_unordered"].astype("category")
>>> df["age_phase_unordered"].cat.categories.tolist()
['E12 G1', 'E12 G2M', 'E12 S', 'E15 G1', 'E15 G2M', ...]
>>> df["age_phase_ordered"] = make_combined_categorical_column(df, "age", "phase") 
>>> df["age_phase_ordered"].cat.categories.tolist()
['E12 G1', 'E12 S', 'E12 G2M', 'E15 G1', 'E15 S', ...]