python_bioinformagicks.utilities package#
Module contents#
- get_grouped_categories(df: DataFrame, groupby: str, column_to_reorder: str, sort: bool = False)[source]#
Given a dataframe, subset by the groupby column, identify unique values in the grouped column_to_reorder, and return a flat list of unique categories, ordered first by parent group then by initial category ordering (or natsorted ordering).
Useful for resetting categories of the column_to_reorder.
Parameters#
- df: pd.Dataframe
The dataframe. Must contain categorical columns groupby and column_to_map
- groupby: str
The column name in df to group/subset by. Must be a categorical column.
- column_to_reorder: str
The column name in df to reorder. Must be a categorical column.
- sort: bool (default: False)
If
True, natsort the categories in each group before insertion into final list. IfFalse, leave the original category ordering.
Returns#
- new_category_order: list of str
The reordered categories.
Usage#
Here we will reorder categories in
df["celltype"]by first grouping bydf["compartment"]. In this case, since the Endothelial compartment is the first category indf["compartment"], the celltypes belonging to that compartment will be ordered before those of the next compartment (Epithelial), and so on.>>> df["celltype"].cat.categories.tolist() ['ASM', 'AT1', 'AT1 | AT2', 'AT2', 'Alveolar Macrophage', ...] >>> df["compartment"].cat.categories.tolist() ['Endothelial', 'Epithelial', 'Immune', 'Mesenchymal'] >>> new_order = get_grouped_categories(df, "compartment", "celltype", sort=True) >>> df["celltype"] = df["celltype"].cat.reorder_categories(new_order) >>> df["celltype"].cat.categories.tolist() ['Artery', 'Lymph', 'Proliferating gCap', 'Vein', 'aCap', 'gCap', 'AT1', ...]
- get_proportions(df: DataFrame, outer_col: str, inner_col: str, return_counts: bool = False)[source]#
Calculates how many items from each outer_col are also in inner_col as a fraction of the total items in outer_col.
Parameters#
- df: pd.DataFrame
The dataframe containing at least the columns outer_col and inner_col.
- outer_col: str
The column name of the outermost column; often batch, sample, or genotype.
- inner_col: str
The column name of the innermost column; often celltype, leiden.
- return_counts: bool (default: False)
If True, return the number of cells, otherwise return the fraction.
- in_ignore_list(g: str)[source]#
Given a gene symbol g, return True if this gene is any of the following:
A mitochondrial gene
A ribosomal gene
A hemoglobin gene
A lncRNA gene
An antisense gene
A microRNA
An uncharacterized/predicted gene
These genes are often uninformative or not a focus of study in transcriptomic analyses, and their inclusion in differential expression testing results can be distracting to readers.
This function identifies such nuiscance genes based on gene symbol alone and as such may miss or include genes erronously.
Parameters#
- g: str
The gene symbol to test
Returns#
True if g is in any of those categories, else False.
Usage#
>>> in_ignore_list("Gm12941") True >>> in_ignore_list("Actb") False >>> adata.var["ignore"] = [in_ignore_list(g) for g in adata.var.index]
- make_combined_categorical_column(df: DataFrame, col_a: str, col_b: str, category_order: list[str] = None)[source]#
Given a dataframe df and two column labels, generates a new categorical series as a combination of the two columns.
Roughly a wrapper around the following code: (df[col_a].astype(str) + “ “ + df[col_b].astype(str)).astype(“category”)
Parameters#
- df: pandas.DataFrame
The input dataframe
- col_a, col_b: str
Names of columns in df to combine. Must be str or categorical type columns.
- category_order: list of str (default: None)
The order of categories in the new combined column. New column categories are always separated by spaces, i.e.: col_a_val + “ “ + col_b_val New category order must include only all possible category new values.
When None, ordered with first column order as parent, second column order as child, i.e. [“a0 b0”, “a0 b1”, …, “a1 b0”, …]
Returns#
- new_col: pd.Series
The new categorical series
Usage#
Here,
df["age"]anddf["phase"]are categorical columns that have had their categories manually ordered by the user. We first demonstrate a simple approach to combining the columns that will default to alphabetical category ordering. We then compare to the result of this function, which maintains the original category ordering.>>> df["age"].cat.categories.tolist() ['E12', 'E15', 'E17', 'E19', 'P3', ...] >>> df["phase"].cat.categories.tolist() ['G1', 'S', 'G2M'] >>> df["age_phase_unordered"] = df["age"].astype(str) + " " + df["phase"].astype(str) >>> df["age_phase_unordered"] = df["age_phase_unordered"].astype("category") >>> df["age_phase_unordered"].cat.categories.tolist() ['E12 G1', 'E12 G2M', 'E12 S', 'E15 G1', 'E15 G2M', ...] >>> df["age_phase_ordered"] = make_combined_categorical_column(df, "age", "phase") >>> df["age_phase_ordered"].cat.categories.tolist() ['E12 G1', 'E12 S', 'E12 G2M', 'E15 G1', 'E15 S', ...]