Statistical Tests

the multicompare function can run repeated statistical test across a features (genes). It requires to select a list of groups to be compared, and a one-hot-encoded matrix with groups membership. If only two groups are provided, Mann-Whitney U test [Mann1947] will be run, if more, a Kruskal-Wallis H test [Kruskal1952] will be used instead. A data dataframe matrix containing the values to compare (e.g. TMM-normalized counts obtained with edger.build_dgelist) needs to be provided.

The output genes can be filtered through a p-value cutoff, while further flags like which method to use (multi_method) and the \({\alpha}\) (multi_alpha) can be provided for the multiple testing correction.

from tapir.stats import multicompare

stats, dunn = multicompare(groups, membership, data,
      cutoff=1, multi_method='fdr_tsbh', multi_alpha=0.05)

In output is a dataframe with median values, ratio, difference and p-values for each provided genes. If Kruskal-Wallis was selected, the Dunn post-hoc test [Dunn1961] results are also provided.

Contingency tables

Contingency tables can be built and related tests can be run with TAPIR.

form tapir.stats import get_contingency, test_contingency

contab = get_contingency(series, groups, membership)
stats  = test_contingency(contab, method='auto')

The get_contingency function requires a series containing the variable to evaluate (e.g. sex, therapy status, mutation), the groups to compare and the one-hot-encoded membership table. The significance of the resulting table can be then measured with test_contingency. This will automatically select between Fisher exact test [Fisher1992] if a 2x2 matrix is provided or a \({\chi^2}\) test [Pearson1900] otherwise, but each method can be manually chosen. If the \({\chi^2}\) test is selected and insufficient populations are provided, the function will throw a warning and return None.

Survival lines

Survival analysis is available through lifelines. For now, only Kaplan-Meier fitted curves, and log-ratio are available.

The survival data (st_stats) needs to be formatted as a pandas dataframe with the time value in survival_times and a binary death event observation in event_observed.

As for other functions, a list of groups to be compared needs to be provided, together with a one-hot-encoded matrix with groups membership.

from tapir.stats import st_curves
from tapir.plotting import plot_survival

stats, curves = st_curves(st_stats, groups, membership)
plot_survival(curves, xlab='Years', ylab='OST', save_file='./plot.png')

The resulting p-value can be found in stats, while curves can be plotted with plot_survival.

_images/surv.png

Dimensionality reduction

TAPIR provides a quick interface for dimensionality reduction with umap and plotting its results. get_umap only takes the data to be mapped (e.g. expression counts), with samples as rows and features as columns. The var_drop_thresh cutoff can be provided for low variance removal. The most variant genes whose variance sum up to the given threshold percentage will be kept. Alternatively collinear_thresh collinearity can be removed by providing a correlation threshold, albeit the current implementation is particularly slow and not recommended. UMAP will be run with preselected settings, but these can be adjusted by providing the appropriate UMAP object keywords.

from tapir.embedding import get_umap
from tapir.plotting import plot_clusters

proj, mappa = get_umap(data, collinear_thresh=None, var_drop_thresh=.99)
proj.index  = data.index

plot_clusters(proj, groups=None, values=data['MYCN'], clab='log$_2$(TPM+1)',save_file='./map.png')

Continuous values can be provided as colormap when plotting. Alternatively if a list of groups is provided, the datapoints will be coloured accordingly.

_images/map.png

Other plots

The expression values or gene set enrichment scores can be plotted as distributions using plot_distribution. Groups and membership table need to be provided. This function allows to plot on one (genes_up) or two levels (if genes_dw is also provided) for an easy comparison.

from tapir.plotting import plot_distribution

plot_distribution(data, groups, membership,
  genes_up, genes_dw,
  save_file='./distribution.png')
_images/dist.png

Similarly, the median values can be plotted as a heatmap with plot_heatmap

from tapir.plotting import plot_heatmap

plot_heatmap(data, groups, membership, genes,
  clab='log$_2$(TPM+1)',
  save_file='./heatmap.png')
_images/heatmap.png

Labels and color map range can be customized to a degree. For the full list of available options and their use, see API.

References

Mann1947

Mann, H. B., Whitney, D. R. (1947). “On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other”, Annals of Mathematical Statistics. 18 (1): 50–60.

Kruskal1952

Kruskal W. H., Wallis W. A. (1952). “Use of ranks in one-criterion variance analysis”, Journal of the American Statistical Association. 47 (260): 583–621.

Dunn1961

Dunn O. J. (1961). “Multiple Comparisons among Means”, Journal of the American Statistical Association, 56:293, 52-64.

Fisher1992

Fisher R. A. (1992). “Statistical Methods for Research Workers”, In: Kotz S., Johnson N.L. (eds) “Breakthroughs in Statistics”. Springer Series in Statistics (Perspectives in Statistics). Springer, New York, NY.

Pearson1900

Pearson, K. (1900). “On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling”, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302), 157–175.