API

Statistical Analysis

EdgeR

Dimensionality Reduction

Dimensionality reduction functions for Transcriptional Analysis with Python Imported from R (TAPIR) F. Comitani @2021

embedding.get_umap(vals, collinear_thresh=None, var_drop_thresh=None, n_neighbors='sqrt', **kwargs)

Wrapper function for UMAP dimensionality reduction.

Parameters

vals (np.array or pandas dataframe) – expression counts matrix with samples as row and features (genes) as columns.
var_drop_thresh (float) – cumulative variance cutoff for near_zero_var_drop. If None, skip this step (default None).
var_drop_thresh – correlation cutoff for remove_collinear. If None, skip this step (default None).
n_neighbors (int) – number of nearest neighbours for UMAP dimensionality reduction. If ‘sqrt’ take the square root of the total number of samples (default ‘sqrt’).
kwargs (dict) – dictionary containing further arguments for UMAP.

Returns

coordinates of the projected points. trained_map (UMAP): the trained UMAP object.

Return type

proj (pandas dataframe)

embedding.near_zero_var_drop(df, thresh=0.99)

Remove features with near-zero variance.

Parameters

df (pandas dataframe) – expression counts matrix with samples as row and features (genes) as columns. All features within this matrix are compared.
thresh (float) – cumulative variance cutoff. Features whose variance sum up to this percentage of the total variance will be kept (default 0.99).

Returns

reduced expression counts matrix.

Return type

(pandas dataframe)

embedding.remove_collinear(df, thresh=0.75)

Remove collinearity. WARNING: slow!

Parameters

df (pandas dataframe) – expression counts matrix with samples as row and features (genes) as columns. All features within this matrix are compared.
thresh (float) – correlation cutoff. Features whose correlation is above this value will be candidate for removal (default 0.75).

Returns

reduced expression counts matrix.

Return type

(pandas dataframe)

Gene Sets Enrichment Analysis

Gene sets enrichment analysis functions for Transcriptional Analysis with Python Imported from R (TAPIR) F. Comitani @2021

gsets.connection_matrix_gsets(gset, sets_dict)

Builds a matrix counting the number of times genes are found in together in gene sets.

Parameters

gset (list of string) – the list of genes to search and compare.
sets_dict (dictionary) – a dictionary of the gene sets and to be considered for the counting. The format should follow the output of gset_as_dict.

Returns

a square dataframe with the number: of common gene sets between each provided gene pair.

Return type

(pandas dataframe)

gsets.gset_as_dict(subsel=None, ref=None)

Read in gene sets reference file and transforms it in a dictionary.

Parameters

subsel (list of strings) – gene sets to subselect. All sets containing any of the strings in this list will be kept (default None).
ref (str) – path to the reference files containing the gene sets to be included in the analysis, if None use the provided file (default None).

Returns

a dictionary with gene sets names: as keys and the list of member genes as value.

Return type

sets_dict (dictionary)

gsets.run_gsea(df, subsel=None, type='gsea', ref=None, tmp_path='./tmp_gsea', **kwargs)

Run gene sets enrichment analysis.

Parameters

df (pandas dataframe) – count matrix or preranked list of genes to be used for enrichment analysis
subsel (list of strings) – list of pathways to subselect. All pathways containing the provided strings will be selected (e.g. ‘HALLMARK_’ will select all hallmark of cancer pathways, default None) .
type (str) – chose the type of analysis to run, single sample ‘ssgsea’, preranked ‘prerank’ or standard ‘gsea’ (default ‘gsea’).
ref (str) – path to the reference files containing the gene sets to be included in the analysis, if None use the provided file (default None).
tmp_path (str) – path where temporary files will be stored. If it doesn’t exist the function will try and build it (default ./tmp_gsea).
**kwargs – keyword parameters for the enrichment analysis functions.

Returns

dataframe with enrichment scores and p-values,: where available.

Return type

(pandas dataframe)

Immune Deconvolution

Plotting

Plotting functions for Transcriptional Analysis with Python Imported from R (TAPIR) F. Comitani @2021

class plotting.Palettes

Bases: object

Container for color palettes.

greypal = ['#FFFFFF', '#333333']

greypalmap = <matplotlib.colors.LinearSegmentedColormap object>

midpal = ['#355C7D', '#6C5B7B', '#C06C84', '#F67280', '#F8B195']

midpalmap = <matplotlib.colors.LinearSegmentedColormap object>

nupal = ['#247ba0', '#70c1b3', '#b2dbbf', '#f3ffbd', '#ff7149']

nupal_bin = ['#247ba0', '#92BDD0', '#ffffff', '#FFB8A4', '#ff7149']

nupalmap = <matplotlib.colors.LinearSegmentedColormap object>

nupalmap_bin = <matplotlib.colors.LinearSegmentedColormap object>

plotting.plot_clusters(proj, groups=None, values=None, clab='log$_2$(TPM+1)', grid=False, palette=None, save_file=None)

Plot survival curves.

Parameters

proj (pandas dataframe) – UMAP embedded space, with samples as rows and x, and y coordinates as columns.
groups (pandas series) – list of classes or clusters for the categorical color map. If None, skip. This takes the precedence when both groups and values are provided (default None).
values (pandas series) – list of values for the continuous color map. If None, skip. This is ignored when both groups and values are provided (default None).
clab (string) – color bar label (time, default ‘log$_2$(TPM+1)’).
grid (bool) – if True, show the grid and coordinate values on the axes (default False).
palette (list of strings) – list of colours for the curves to plot. If None, use the default 10 colors matplotlib palette (default None).
save_file (string) – path and name to png file where plot will be saved. If None, save in the current folder (default None).

plotting.plot_distribution(df, groups, labs, up_feat, dw_feat=None, save_file=None)

Plot distributions and median values for given features: and groups. These can be plotted on two levels (up and dw) for an easy comparison.

Parameters

df (panda dataframe) – count values matrix containing the information to plot.
groups (list of int or strings) – the groups for which the distributions will be plotted (as rows)
labs (panda dataframe) – one-hot-encoded classes membership dataframe with samples as rows and classes as columns.
up_feat (list of strings) – the features to plot.
dw_feat (list of strings) – additional features to plot at the bottom of each row, to compare with the up_feat. This argument is optional.
save_file (string) – path and name to png file where plot will be saved. If None, save in the current folder (default None).

plotting.plot_genes_network(gset, subsel, ref=None, exp=None, cutoff=0.1, save_file=None)

Plot a gene set as a network of interconnected genes. Each gene is represented by a circle, whose size is proportional to the number of gene sets it appears in. Genes are connected by lines, representing the number of gene sets the connected genes appear in together. The map will attempt to put genes that appear often together in proximity.

Parameters

gset (string) – name of the gene set to plot.
subsel (list of strings) – gene sets to subselect, that will be used for building the connection matrix. All sets containing any of the strings in this list will be kept (default None).
ref (str) – path to the reference files containing the gene sets to be included in the analysis, if None use the provided file (default None).
exp (panda dataframe) – expression counts. If provided, genes will be colour coded according to their relative expression within the gene set (dark = low, light = high, default None).
cutoff (float) – percentage cutoff for the connection lines. Only lines that reach above this percentage, relative to the maximum connection value reached in the matrix, will be shown (default 0.1)
save_file (string) – path and name to png file where plot will be saved. If None, save in the current folder (default None).

plotting.plot_heatmap(df, groups, labs, feats, diverging=False, vmin=None, vmax=None, clab='log$_2$(TPM+1)', save_file=None)

A heatmap for groups and features (genes).

Parameters

df (panda dataframe) – count values matrix containing the information to plot.
groups (list of int or strings) – the groups for which the distributions will be plotted (as rows)
labs (panda dataframe) – one-hot-encoded classes membership dataframe with samples as rows and classes as columns.
feats (list of strings) – the features to plot.
diverging (bool) – if True, use a diverging palette, useful if the range to plot is not strictly positive (default False).
vmin (float) – minimum boundary for the color bar, if None it will be inferred from the data (default None).
vmax (float) – maximum boundary for the color bar, if None it will be inferred from the data (default None).
clab (string) – color bar label (time, default ‘log$_2$(TPM+1)’).
save_file (string) – path and name to png file where plot will be saved. If None, save in the current folder (default None).

plotting.plot_survival(curves, xlab='years', ylab='OST', palette=None, save_file=None)

Plot survival curves.

Parameters

curve (pandas dataframe) – dataframe containing survival values by group (columns) and time (index). The format should correspond to the output of st_curves.
xlab (string) – x-axis label (time, default ‘years’).
ylab (string) – y-axis label (survival counts, default ‘OST’).
palette (list of strings) – list of colours for the curves to plot. If None, use the default 10 colors matplotlib palette (default None).
save_file (string) – path and name to png file where plot will be saved. If None, save in the current folder (default None).

Utils

Auxiliary functions for Transcriptional Analysis with Python Imported from R (TAPIR) F. Comitani @2021

auxiliary.invert_dict(dictio)

Invert dictionary whose values are lists and are not unique.

Parameters

dictio (dictionary) – the dictionary to invert.

Returns

a dictionary where the old values are: now keys and the old keys are now in the value lists.

Return type

inv_dict (dictionary)

auxiliary.smart_selection(labs, to_select, how='any', val=1)

Simplify selection of a single or multiple groups in a pandas dataframe.

Parameters

labs (panda dataframe) – one-hot-encoded classes membership dataframe with samples as rows and classes as columns.
to_select (list of int or strings) – the groups to be selected.
how (string) – selection type, samples are chosen if they belong to ‘any’ or ‘all’ classes (default ‘how’).
val (int) – selection value, samples are chosen if their value in the classes memberhsip dataframe corresponds to this (default 1).

Returns

boolean series with the corresponding selection.

Return type

(panda series)