API
Statistical Analysis
EdgeR
Dimensionality Reduction
Dimensionality reduction functions for Transcriptional Analysis with Python Imported from R (TAPIR) F. Comitani @2021
- embedding.get_umap(vals, collinear_thresh=None, var_drop_thresh=None, n_neighbors='sqrt', **kwargs)
Wrapper function for UMAP dimensionality reduction.
- Parameters
vals (np.array or pandas dataframe) – expression counts matrix with samples as row and features (genes) as columns.
var_drop_thresh (float) – cumulative variance cutoff for near_zero_var_drop. If None, skip this step (default None).
var_drop_thresh – correlation cutoff for remove_collinear. If None, skip this step (default None).
n_neighbors (int) – number of nearest neighbours for UMAP dimensionality reduction. If ‘sqrt’ take the square root of the total number of samples (default ‘sqrt’).
kwargs (dict) – dictionary containing further arguments for UMAP.
- Returns
coordinates of the projected points. trained_map (UMAP): the trained UMAP object.
- Return type
proj (pandas dataframe)
- embedding.near_zero_var_drop(df, thresh=0.99)
Remove features with near-zero variance.
- Parameters
df (pandas dataframe) – expression counts matrix with samples as row and features (genes) as columns. All features within this matrix are compared.
thresh (float) – cumulative variance cutoff. Features whose variance sum up to this percentage of the total variance will be kept (default 0.99).
- Returns
reduced expression counts matrix.
- Return type
(pandas dataframe)
- embedding.remove_collinear(df, thresh=0.75)
Remove collinearity. WARNING: slow!
- Parameters
df (pandas dataframe) – expression counts matrix with samples as row and features (genes) as columns. All features within this matrix are compared.
thresh (float) – correlation cutoff. Features whose correlation is above this value will be candidate for removal (default 0.75).
- Returns
reduced expression counts matrix.
- Return type
(pandas dataframe)
Gene Sets Enrichment Analysis
Gene sets enrichment analysis functions for Transcriptional Analysis with Python Imported from R (TAPIR) F. Comitani @2021
- gsets.connection_matrix_gsets(gset, sets_dict)
Builds a matrix counting the number of times genes are found in together in gene sets.
- Parameters
gset (list of string) – the list of genes to search and compare.
sets_dict (dictionary) – a dictionary of the gene sets and to be considered for the counting. The format should follow the output of gset_as_dict.
- Returns
- a square dataframe with the number
of common gene sets between each provided gene pair.
- Return type
(pandas dataframe)
- gsets.gset_as_dict(subsel=None, ref=None)
Read in gene sets reference file and transforms it in a dictionary.
- Parameters
subsel (list of strings) – gene sets to subselect. All sets containing any of the strings in this list will be kept (default None).
ref (str) – path to the reference files containing the gene sets to be included in the analysis, if None use the provided file (default None).
- Returns
- a dictionary with gene sets names
as keys and the list of member genes as value.
- Return type
sets_dict (dictionary)
- gsets.run_gsea(df, subsel=None, type='gsea', ref=None, tmp_path='./tmp_gsea', **kwargs)
Run gene sets enrichment analysis.
- Parameters
df (pandas dataframe) – count matrix or preranked list of genes to be used for enrichment analysis
subsel (list of strings) – list of pathways to subselect. All pathways containing the provided strings will be selected (e.g. ‘HALLMARK_’ will select all hallmark of cancer pathways, default None) .
type (str) – chose the type of analysis to run, single sample ‘ssgsea’, preranked ‘prerank’ or standard ‘gsea’ (default ‘gsea’).
ref (str) – path to the reference files containing the gene sets to be included in the analysis, if None use the provided file (default None).
tmp_path (str) – path where temporary files will be stored. If it doesn’t exist the function will try and build it (default ./tmp_gsea).
**kwargs – keyword parameters for the enrichment analysis functions.
- Returns
- dataframe with enrichment scores and p-values,
where available.
- Return type
(pandas dataframe)
Immune Deconvolution
Plotting
Plotting functions for Transcriptional Analysis with Python Imported from R (TAPIR) F. Comitani @2021
- class plotting.Palettes
Bases:
object
Container for color palettes.
- greypal = ['#FFFFFF', '#333333']
- greypalmap = <matplotlib.colors.LinearSegmentedColormap object>
- midpal = ['#355C7D', '#6C5B7B', '#C06C84', '#F67280', '#F8B195']
- midpalmap = <matplotlib.colors.LinearSegmentedColormap object>
- nupal = ['#247ba0', '#70c1b3', '#b2dbbf', '#f3ffbd', '#ff7149']
- nupal_bin = ['#247ba0', '#92BDD0', '#ffffff', '#FFB8A4', '#ff7149']
- nupalmap = <matplotlib.colors.LinearSegmentedColormap object>
- nupalmap_bin = <matplotlib.colors.LinearSegmentedColormap object>
- plotting.plot_clusters(proj, groups=None, values=None, clab='log$_2$(TPM+1)', grid=False, palette=None, save_file=None)
Plot survival curves.
- Parameters
proj (pandas dataframe) – UMAP embedded space, with samples as rows and x, and y coordinates as columns.
groups (pandas series) – list of classes or clusters for the categorical color map. If None, skip. This takes the precedence when both groups and values are provided (default None).
values (pandas series) – list of values for the continuous color map. If None, skip. This is ignored when both groups and values are provided (default None).
clab (string) – color bar label (time, default ‘log$_2$(TPM+1)’).
grid (bool) – if True, show the grid and coordinate values on the axes (default False).
palette (list of strings) – list of colours for the curves to plot. If None, use the default 10 colors matplotlib palette (default None).
save_file (string) – path and name to png file where plot will be saved. If None, save in the current folder (default None).
- plotting.plot_distribution(df, groups, labs, up_feat, dw_feat=None, save_file=None)
- Plot distributions and median values for given features
and groups. These can be plotted on two levels (up and dw) for an easy comparison.
- Parameters
df (panda dataframe) – count values matrix containing the information to plot.
groups (list of int or strings) – the groups for which the distributions will be plotted (as rows)
labs (panda dataframe) – one-hot-encoded classes membership dataframe with samples as rows and classes as columns.
up_feat (list of strings) – the features to plot.
dw_feat (list of strings) – additional features to plot at the bottom of each row, to compare with the up_feat. This argument is optional.
save_file (string) – path and name to png file where plot will be saved. If None, save in the current folder (default None).
- plotting.plot_genes_network(gset, subsel, ref=None, exp=None, cutoff=0.1, save_file=None)
Plot a gene set as a network of interconnected genes. Each gene is represented by a circle, whose size is proportional to the number of gene sets it appears in. Genes are connected by lines, representing the number of gene sets the connected genes appear in together. The map will attempt to put genes that appear often together in proximity.
- Parameters
gset (string) – name of the gene set to plot.
subsel (list of strings) – gene sets to subselect, that will be used for building the connection matrix. All sets containing any of the strings in this list will be kept (default None).
ref (str) – path to the reference files containing the gene sets to be included in the analysis, if None use the provided file (default None).
exp (panda dataframe) – expression counts. If provided, genes will be colour coded according to their relative expression within the gene set (dark = low, light = high, default None).
cutoff (float) – percentage cutoff for the connection lines. Only lines that reach above this percentage, relative to the maximum connection value reached in the matrix, will be shown (default 0.1)
save_file (string) – path and name to png file where plot will be saved. If None, save in the current folder (default None).
- plotting.plot_heatmap(df, groups, labs, feats, diverging=False, vmin=None, vmax=None, clab='log$_2$(TPM+1)', save_file=None)
A heatmap for groups and features (genes).
- Parameters
df (panda dataframe) – count values matrix containing the information to plot.
groups (list of int or strings) – the groups for which the distributions will be plotted (as rows)
labs (panda dataframe) – one-hot-encoded classes membership dataframe with samples as rows and classes as columns.
feats (list of strings) – the features to plot.
diverging (bool) – if True, use a diverging palette, useful if the range to plot is not strictly positive (default False).
vmin (float) – minimum boundary for the color bar, if None it will be inferred from the data (default None).
vmax (float) – maximum boundary for the color bar, if None it will be inferred from the data (default None).
clab (string) – color bar label (time, default ‘log$_2$(TPM+1)’).
save_file (string) – path and name to png file where plot will be saved. If None, save in the current folder (default None).
- plotting.plot_survival(curves, xlab='years', ylab='OST', palette=None, save_file=None)
Plot survival curves.
- Parameters
curve (pandas dataframe) – dataframe containing survival values by group (columns) and time (index). The format should correspond to the output of st_curves.
xlab (string) – x-axis label (time, default ‘years’).
ylab (string) – y-axis label (survival counts, default ‘OST’).
palette (list of strings) – list of colours for the curves to plot. If None, use the default 10 colors matplotlib palette (default None).
save_file (string) – path and name to png file where plot will be saved. If None, save in the current folder (default None).
Utils
Auxiliary functions for Transcriptional Analysis with Python Imported from R (TAPIR) F. Comitani @2021
- auxiliary.invert_dict(dictio)
Invert dictionary whose values are lists and are not unique.
- Parameters
dictio (dictionary) – the dictionary to invert.
- Returns
- a dictionary where the old values are
now keys and the old keys are now in the value lists.
- Return type
inv_dict (dictionary)
- auxiliary.smart_selection(labs, to_select, how='any', val=1)
Simplify selection of a single or multiple groups in a pandas dataframe.
- Parameters
labs (panda dataframe) – one-hot-encoded classes membership dataframe with samples as rows and classes as columns.
to_select (list of int or strings) – the groups to be selected.
how (string) – selection type, samples are chosen if they belong to ‘any’ or ‘all’ classes (default ‘how’).
val (int) – selection value, samples are chosen if their value in the classes memberhsip dataframe corresponds to this (default 1).
- Returns
boolean series with the corresponding selection.
- Return type
(panda series)