Multiple Hypothesis Testing with Persistent Homology

Mikael Vejdemo-Johansson and Sayan Mukherjee

5 June 2022

Multiple Hypothesis Testing with Persistent Homology

Our contributions:

New method for testing whether a persistence diagram is consistent with a null model
New method for controlling Family-Wise Error Rate in persistent homology
New method for controlling False Discovery Rate in persistent homology

Multiple Hypothesis Testing with Persistent Homology

This talk:

Survey hypothesis testing in persistent homology
Introduce concepts from theoretical statistics
Testing for consistency with a null model
Multiple testing for consistency with a null model
Power and level estimates
False discovery rate control
FDR control for Robinson-Turner style testing

Hypothesis Testing in Persistent Homology

Question Does this persistence diagram show any structure?

Question Which of these persistence diagrams show structure?

Question Are these persistence diagrams different from those persistence diagrams?

Hypothesis Testing in Persistent Homology

Bubenik
Persistence Landscapes

Each persistence diagram converts to a function \(\mathbb{R}\times\mathbb{N}\to\mathbb{R}\).

Pointwise, the central limit theorem holds. Families of persistence diagrams can be analyzed, averaged, using classical statistical methods and pointwise asymptotic normal distributions.

Hypothesis Testing in Persistent Homology

Fasy, Lecci, Rinaldo, Wasserman, Balakrishnan, Singh
Stability and subsampling

Confidence regions for persistence diagrams can be generated by:

Subsample data, measure Hausdorff distance from sample to full dataset to get distribution function of dataset distances
Pick cutoff \(\epsilon_\alpha\) from distribution function
Stability theorem: Hausdorff distance bounds bottleneck distance

Hypothesis Testing in Persistent Homology

Fasy, Lecci, Rinaldo, Wasserman, Balakrishnan, Singh
Stability and distance to measure

Confidence regions for persistence diagrams can be generated by:

Compare \(L_\infty\) distances between empirical distance to measure functions from the data with those from a bootstrap sample
Pick cutoff \(\epsilon_\alpha\) from distribution function
Stability theorem: \(L_\infty\) bounds bottleneck distance

Hypothesis Testing in Persistent Homology

Robinson and Turner
Two-sample testing

To compare point clouds \(X_1,\dots,X_n\) to point clouds \(Y_1,\dots,Y_m\)

Compute a bottleneck distance matrix for all the point clouds
Use an in-group distance cost function
Compare cost of observed partition with costs of permuted partitions

Hypothesis Testing in Persistent Homology

Cericola, Johnson, Kiers, Krock, Purdy, Torrence
Multiple sample (ANOVA style) testing

Extend the Robinson-Turner cost function to measure differences between three or more groups of persistence diagrams

Hypothesis Testing in Persistent Homology

Cericola, Johnson, Kiers, Krock, Purdy, Torrence
Multiple sample (ANOVA style) testing

Extend the Robinson-Turner cost function to measure differences between three or more groups of persistence diagrams

If such differences do exist, we then propose using a number of post-hoc (i.e. two-space) permutation tests to identify the specific pairwise differences.

Error Types

	Null Accepted	Null Rejected
Null True	\(U\)	\(V\)
Null False	\(T\)	\(S\)

\(U\) and \(S\) count correct behaviors.

\(V\) counts false discoveries (false positives, Type I errors)

\(T\) counts missed discoveries (false negatives, Type II errors)

\(\alpha = V/N\) – probability (rate) of false discoveries

\(\beta = T/N\) – probability of missed discoveries

The Hazard of Repeated Testing

Suppose \(k\) different post-hoc tests are performed. We should expect on average \(\alpha\cdot k\) of these to come out significant. This rate of positive tests can become dramatically large.

To achieve the chosen level \(\alpha\) overall, the multiple tests need to be modified to not inflate the discoveries.

Multiple Testing Correction

Bonferroni If \(\alpha\cdot k\) come out significant, replace the threshold for each test by \(\alpha/k\) (or the p-value \(p\) with \(p\cdot k\)).

Bonferroni is known to be conservative – because multiple tests can be simultaneously significant.

Other methods (Holm, Hochberg) exist that rank p-values and then test against thresholds \(m\alpha/k\) as \(m\) varies.

Multiple Testing Correction

In all of these, \(\alpha\) is replaced by \(\alpha/k\) (with variations).

In permutation- and simulation testing contexts, this lowered threshold can dramatically increase computational load.

Hypothesis Testing

Basic task: choose between two competing hypotheses \(H_0\) and \(H_1\).

Basic method: pick a statistic \(T\) and define a critical region \(S\). Accept \(H_0\) when \(T\in S\), and accept \(H_1\) when \(T\not\in S\).

Level of the test is chosen by picking \(S\) so that
\(\mathbb{P}\)(accept \(H_0\) erroneously) \(=\alpha\).

Commonly: \(S = \{T : T > T_0\}\).

Hausdorff or \(L_\infty\) distances (Fasy et al)
Loss function (Robinson-Turner; Cericola et al)

Testing for Acyclicity

Acyclicity: trivial \(H_*\)

Rejecting acyclicity = this persistence diagram demonstrates structure

Fasy et al: check if confidence regions intersect the diagonal

Since the confidence regions are produced from the stability theorem, they are not tight.

Alternative plan, extensible to other applications: use a null model for point clouds.

Explicit null model

Bobrowski - Kahle - Skraba: uniformly distributed points in a region have short persistence bars
Spatial statistics tradition: uniformly distributed points in a region models Complete Spatial Randomness

We propose as a null model for acyclicity:
uniform point clouds

Simulation Testing vs. Null Model

Given: potentially acyclic point cloud \(X\)

Option:

Simulate \(N-1\) null point clouds \(Y_*\)
Compute all distances
Robinson-Turner testing of \(\{X\}\) vs \(\{Y_*\}\).

Requires \(N(N-1)/2\) bottleneck distance computations

Simulation Testing vs. Null Model

Given: potentially acyclic point cloud \(X\)

Alternative:

Pick a test statistic that measures acyclicity of a single point cloud.

Simulate \(N-1\) null point clouds \(Y_*\)
Compute \(T_X\) and all \(T_{Y_j}\)
Rank of \(T_X\) among the \(T_{Y_j}\).

Attractive statistic: bottleneck norm – bottleneck distance to the empty diagram (ie length of the longest bar)

Requires no bottleneck distance computations.

Sidenote on bottleneck norm

Because bottleneck distance is a metric, \[ \|X\|_{\mathfrak{B}} + \|Y\|_{\mathfrak{B}} \geq d_{\mathfrak{B}}(X,Y) \]

In our work, we have not addressed any questions of stability for the bottleneck norm.

Multiple Testing and Null Model - Motivation

Motivating Question: Given a Mapper cover, is the cover good in the sense of the Nerve Lemma?

If using persistence hypothesis testing, each simplex in the Mapper complex requires a separate test. For rejection of a good cover it is enough to see one test reject the null model.

This is the exact setup for Family-Wise Error Rate correction (ie Bonferroni, Holm, Hochberg, …):

Reduce \(\mathbb{P}\)(rejecting any one hypothesis in error) to \(\alpha\).

Cutting out \(\alpha/k\): standardization

Idea:

To each \(X_i\), simulate N-1 null point clouds \(Y^j_i\).
Compute persistent homology and a statistic \(T_{X_i}, T_{Y^j_i}\) for each.
Use \(T_{Y^*_i}\) to estimate mean \(\mu_i\) and standard deviation \(s_i\), and then standardize: \[ x_i = \frac{T_{X_i}-\mu_i}{s_i} \qquad y_i^j = \frac{T_{Y_i^j}-\mu_i}{s_i} \]

Cutting out \(\alpha/k\): standardization

Next, treat each “column” as a separate batch simulation.

If the standardized statistics follow the same sampling distributions, and
If we reject the null for large values of \(T\), and
If we want to control the family-wise error rate

Then take column maxima, and treat as a simulation test: evaluate the rank of \(\max_i x_i\) among all the \(\max_i y_i^j\).

Why would this work?

Is it actually the case that the sampling distribution of the standardized statistics are equal across different densities and shapes of uniformly sampled points?

Sufficient: check that sampling distributions of the un-standardized statistics are equal up to a linear transformation.

Why would this work?

Multiple Testing vs Null Model

The resulting test procedure is:
given point clouds \(X_1,\dots,X_m\),

Sample \(Y_1^1,\dots,Y_1^{N-1},Y_2^1,\dots,\dots,Y_m^{N-1}\) from the null model, fitting density and shape for each \(Y_i^j\) to match \(X_i\).
Compute persistent homology and test statistics \(T_{X_i}, T_{Y_i^j}\)
Estimate row-wise mean and standard deviations from \(Y_i^*\)
Standardize each row to produce \(x_i, y_i^j\)
Compute column-wise \(x = \max_i x_i\) and \(y^j = \max_i y_i^j\)
The rank of \(x\) among the \(y^j\) provides p-value

False Discovery Rate Control

False Discovery Rate

Suppose we are performing \(k\) hypothesis tests.

The Family-Wise Error Rate is the probability that even one single false discovery is made among these tests. Controlling the FWER is often too harsh a requirement: we know that false discoveries will happen among repeated tests.

More desirable: bounding how many of the discoveries are false.

Write \(q_{FDR} = \mathbb{E}[V/R]\), where as above

\(V\) is the number of false discoveries
\(R\) is the total number of discoveries

False Discovery Rate Control

Since \[ \frac{V}{R} = \frac{V/N}{R/N} = \frac{\% V}{\% R} \] we can estimate \(q_{FDR}\) by estimating the rate of discoveries and the rate of false discoveries separately.

We can use the test statistics of the null model draws to estimate \(\% V\), and the test statistics from the observations to estimate \(\% R\).

False Discovery Rate Control

Given a family \(X_1,\dots,X_k\) of point clouds, and a null model \(\mathcal{M}\):

Draw null model point clouds, and standardize everything - just like in the previous algorithm.
For each critical cutoff \(c_i = x_{i}\), calculate \[ \%V(c_i) = \frac{\#\{y_i^j \geq c_i\}}{k(N-1)} \qquad \%R(c_i) = \frac{\#\{x_i\geq c_i\}}{k} \]
With \(\hat q_{FDR}(c_i) = {\%V(c_i)}\,/\,{\%R(c_i)}\), pick the smallest \(c_i\) such that \(\hat q_{FDR}(c_i)\leq\alpha\).
Reject all null hypotheses with \(x_{j}\geq c_i\).

\(\min_{c_i} \hat q_{FDR}(c_i)\) is the smallest achievable FDR.

False Discovery Rate Control

The same approach can be used to control the FDR in Robinson-Turner style two-sample testing.

Here, the permuted costs form the basis to estimate \(\%V\), and the observed costs estimate \(\%R\).

To write out the process in detail we will need fairly many indices:

Given: \(h\) pairs of groups of persistence diagrams \(\mathcal{D}_1^k = \{D_{1,i_1}^k\}\) and \(\mathcal{D}_2^k \{D_{2,i_2}^k\}\), where \(1\leq k\leq h\), \(1\leq i_1\leq N_1\) and \(1\leq i_2\leq N_2\).
To be tested: \(H_k:\) the groups \(\mathcal{D}_1^k\) and \(\mathcal{D}_2^k\) come from the same population.

False Discovery Rate Control

Calculate \(x_k\), the cost of the group pair \(\mathcal{D}_1^k, \mathcal{D}_2^k\)
Calculate \(y_k^j\), N costs of permuted group pairings
Estimate \[ \%V(x_{k}) = \frac{\#\{y_k^j\leq x_k\}}{h\cdot N} \qquad \%R(x_{k}) = \frac{\#\{x_j\leq x_k\}}{h} \]
With \(\hat q_{FDR}(x_k) = {\%V(x_k)}\,/\,{\%R(x_k)}\), pick the largest \(c = x_k\) such that \(\hat q_{FDR}(x_k)\leq q\).
Reject all null hypotheses with \(x_{j}\leq c\).

\(\min_{x_k} \hat q_{FDR}(x_k)\) is the smallest achievable FDR.

Well, does it work?

False Discovery Rate Control

The FDR control methods described here work by construction: they use null model simulation (permutation) to estimate \(V\), and observe an estimated \(R\) directly from data.

Power and level estimates

Based on simulated uniformly distributed point clouds in the plane, with different point counts and different bounding boxes, we estimate the achieved level of the multiple testing vs null model method.

To estimate power, we use points on unit circles in the plane with varying degrees of Gaussian noise, parametrized by the standard deviation \(\sigma\).

Power and level estimates for Testing vs null model

Attempted Level	0.01	0.05	0.10

Estimated Level (\(\\|\cdot\\|_{\mathfrak{B}}\))	0.01	0.07	0.12
Estimated Level (\(\log\\|\cdot\\|_{\mathfrak{B}}\) )	0.01	0.07	0.12

Estimated Power \(\sigma=0.1\), \(\\|\cdot\\|_{\mathfrak{B}}\)	0.94	0.95	0.96
Estimated Power \(\sigma=0.1\), \(\log\\|\cdot\\|_{\mathfrak{B}}\)	0.94	0.95	0.96

Estimated Power \(\sigma=0.25\), \(\\|\cdot\\|_{\mathfrak{B}}\)	0.51	0.64	0.72
Estimated Power \(\sigma=0.25\), \(\log\\|\cdot\\|_{\mathfrak{B}}\)	0.51	0.64	0.72

Power and level estimates for Multiple testing vs null model

Attempted Level	0.01	0.05	0.10

Estimated Level (\(\\|\cdot\\|_{\mathfrak{B}}\))	0.04	0.11	0.17
Estimated Level (\(\log\\|\cdot\\|_{\mathfrak{B}}\) )	0.03	0.07	0.13

Estimated Power \(\sigma=0.1\), \(\\|\cdot\\|_{\mathfrak{B}}\)	0.89	0.95	0.96
Estimated Power \(\sigma=0.1\), \(\log\\|\cdot\\|_{\mathfrak{B}}\)	0.80	0.83	0.87

Estimated Power \(\sigma=0.25\), \(\\|\cdot\\|_{\mathfrak{B}}\)	0.36	0.52	0.57
Estimated Power \(\sigma=0.25\), \(\log\\|\cdot\\|_{\mathfrak{B}}\)	0.36	0.41	0.51

Thank you for your attention

We have:

Proposed a null model for testing acyclicity
Proposed a one-sample hypothesis test for testing acyclicity
Proposed FWER and FDR control methods for null model based testing
Proposed FDR control methods for Robinson-Turner two-sample testing (producing a framework for Cericola et al post-hoc tests)

https://arxiv.org/abs/1812.06491 M Vejdemo-Johansson and S Mukherjee: Multiple testing with persistent homology

http://www.math.csi.cuny.edu/~mvj/talks/union-22/Union-22.html Talk slides