WMWprob is identical to the area under the curve (AUC) when using a receiver operator characteristic (ROC) analysis to summarize sensitivity vs. specificity in diagnostic testing in medicine or signal detection in human perception research and other fields. Using the oft-used example of Hanley and McNeil (1982), we cover these analyses and concepts.
WMWprob is identical to the area under the curve (AUC) when using a receiver operator characteristic (ROC) analysis to summarize sensitivity vs. specificity in diagnostic testing in medicine or signal detection in human perception research and other fields. Using the oft-used example of Hanley and McNeil (1982), we cover these analyses and concepts.
1. Introduction
​
This website promotes the use of the Wilcoxon-Mann-Whitney parameter (WMWprob) for comparing measurements from two groups. Unquestionably, the most popular method for comparing two independent samples is the two-sample t-test to compare the group means. The t-test relies on often untenable assumptions (normality, equal variances, interval or ratio level of measurement). A popular nonparametric alternative to the t-test is the Wilcoxon-Mann-Whitney test. This test requires only ordinal level of measurement. It is often presented as comparing group medians, but this is true only by again making an often untenable assumption: that the two groups have identical distributions (not necessarily normal) except for a shift in location. This assumption can be avoided by focusing on the Wilcoxon-Mann-Whitney parameter. Letting Y1 and Y2 represent a randomly chosen measurement from each group, the Wilcoxon-Mann-Whitney parameter is defined as
​
WMWprob = Prob[Y1 > Y2] + (1/2) Prob[Y1 = Y2].
​
WMWprob can be expressed equivalently using odds:
​
WMWodds = WMWprob / (1 - WMWprob).
​
The WMW test is best thought of as testing a hypothesis about the WMW parameter (WMWprob or WMWodds). A one-tailed version is
​
H0: WMWprob <= WMWprob0 versus H1: WMWprob > WMWprob0
​
or equivalently
​
H0: WMWodds <= WMWodds0 versus H1: WMWodds > WMWodds0.
​
A conventional null value is WMWprob0 = 0.5 (WMWodds0 = 1.0) though, in practice, an "unconventional" value is generally more appropriate (see the Examples). More importantly, inference about WMWprob should, in most cases, use confidence intervals rather than (or perhaps in addition to) hypothesis testing.
2. Estimation of WMWprob
​
Estimation proceeds by using relative frequencies to express probabilities. Consider the following simple (and completely artificial) example.
​
Y1: 57, 58, 59, 63, 64
Y2: 49, 53, 58, 63
There are 20 (Y1, Y2) pairs. Estimation involves counting the number of pairs for which Y1 > Y2 and Y1 = Y2. The estimate of WMWprob (call it eWMWprob) is
​
eWMWprob = [#(Y1 > Y2) + 0.5 #(Y1 = Y2)] / #pairs
where #() counts the number of occurrences.
​
Here is a list of relevant Y1 vs Y2 comparisons:
57 > 49, 57 > 53
58 > 49, 58 > 53, 58 = 58
59 > 49, 59 > 53, 59 > 58
63 > 49, 63 > 53, 63 > 58, 63 = 63
64 > 49, 64 > 53, 64 > 58, 64 > 63
There are 14 pairs where Y1 > Y2 and 2 pairs where Y1 = Y2. That is,
​
eWMWprob = [14 + 0.5*2] / 20 = (15 / 20) = 0.75.
The WMW methodology is usually described as being based on ranks. That is, if the data values are replaced by their ranks in the combined samples, the results are the same. R1 and R2 below show the ranks corresponding to the Y1 and Y2 data values above. When two (or more) data values are the same (that is, when they are tied), each data value receives the average of the corresponding ranks.
​
R1: 3, 4.5, 6, 7.5, 9
R2: 1, 2, 4.5, 7.5
Doing the pairwise comparisons with these ranks produces the same eWMWprob of 0.75. More interestingly, eWMWprob can be obtained from the ranks without making the pairwise comparisons. As it turns out,
​
eWMWprob = 0.5 + {[Avg(R1) - Avg(R2)] / N}
where N = n1 + n2 is the total sample size. For the example, Avg(R1) = 6.00 and Avg(R2) = 3.75, so that
eWMWprob = 0.5 + [(6.00 - 3.75) / 9] = 0.75.
3. Confidence Intervals for WMWprob
​
We will present three approaches to constructing confidence intervals, though one of them is included for purely expository reasons and is not recommended.
3a. The Wald Interval: The Familiar Approach
The interval is
eWMWprob ± k * StdErr .
Since eWMWprob is normally distributed (asymptotically). "k" can be obtained from normal distribution tables or software functions. For a conventional, symmetric, two-sided, 95% confidence interval, "k" is approximately 2 (more precisely, 1.96).
The theoretical formula for the standard error of eWMWprob has been known since 1951 (see Birnbaum and Klose, 1957). Calculating it is another matter. Sen (1967) provided a convenient and accurate approximation which has been rediscovered by several subsequent investigators. It makes use of quantities called placements which are simply ranks of a particular kind. For example, the placement for the Y1 value of 57 is its rank in the Y2 sample. Computationally, P1[i], the placement of the i-th element of Y1, is
​
P1[i] = #(Y2 < Y1[i]) + 0.5 #(Y2 = Y1[i]).
​
Since there are two values in Y2 that are less than Y1[1] = 57, and none equal, then P1[1] = 2. Here are all the placements.
​
P1: 2, 2.5, 3, 3.5, 4
P2: 0, 0, 1.5, 3.5
Now let
​
V1 = Var(P1 / n2) = 0.039
V2 = Var(P2 / n1) = 0.110
where Var() is the conventional sample variance with denominator (n-1). (P1 / n2) and (P2 / n1) are called relative placements. Coincidentally, eWMWprob = Avg(P1 / n2).
Finally,
StdErr = sqrt[(V1 / n1) + (V2 / n2)] = 0.1879.
The familiar 95% Wald confidence interval is
[0.75 - 1.96*0.1879, 0.75 + 1.96*0.1879] = [0.382, 1.12].
This interval is not recommended. For one thing, the upper limit of 1.12 is not a possible value for WMWprob. More generally, its coverage tends to be less than advertised; that is, the probability that the interval will contain the true value of WMWprob tends to be less than 95%. The following two intervals completely avoid the problem of invalid values while providing good coverage in most cases.
3b. A Quasi-Score Interval
Mee (1990) made the clever suggestion to express the StdErr as if eWMWprob were a binomial proportion, that is,
StdErr = sqrt[WMWprob * (1 - WMWprob) / N*].
N* is a "pseudo-N" that gives an appropriate standard error, such as that of Sen given above. Using Sen's standard error and substituting eWMWprob for WMWprob gives
N* = (eWMWprob) * (1 - eWMWprob) / StdErr.
​
For the example,
N* = 0.75 * 0.25 / (0.1879^2) = 5.31.
In general min(n1,n2) <= N* <= n1*n2.
Wilson (1927) provided a confidence interval for a binomial proportion based on a score test (see Hypothesis Testing). Mee (1990) applied Wilson's formula by treating eWMWprob as a binomial proportion with sample size N*. The formula as usually written looks complex (see https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval) though that is no impediment to its use. However, the formula can also be written in a more familiar form by introducing a quasi-estimate and quasi-standard-error for WMWprob (call them qWMWprob and qStdErr). First define a constant C = k / N* using the same "k" (e.g., 1.96) as in the familiar interval above.
qWMWprob = (eWMWprob + 0.5 * C) / (1 + C)
qStdErr = sqrt{[eWMWprob*(1-eWMWprob) + 0.25*C] / N*} / (1+C)
The quasi-score interval is
qWMWprob +/- k * qStdErr.
For the example, C = 1.96^2 / 5.31 = 0.7235, qWMWprob = 0.6451, qStdErr = 0.1582, giving the 95% confidence interval,
[0.346, 0.945]
The limits of the quasi-score interval are always between 0 and 1 (unlike the Wald interval), and the coverage is generally close to the desired value (e.g., 95%) unless sample sizes are small and/or WMWprob is near zero or one.
3c. Bayes interval, again using Mee's N*.
There is another interval for a binomial proportion can be adapted to the WMW scenario by using N*. It is the Bayesian interval that assumes a beta(a,b) distribution for the (conjugate) prior. In the binomial case with X "successes" in N trials, the posterior distribution of the binomial probability parameter is beta(X+a, N-X+b). For the WMW scenario, replace N by N* and X by X* = N* * eWMWprob. Two popular choices for (a,b) are the "flat" prior, beta(1,1) which is a uniform distribution, and the Jeffreys prior, beta(0.5,0.5). In practice, the prior will often be dictated by the specific problem of interest. See the examples.
The lower and upper confidence limits are obtained as quantiles of the posterior beta distribution. These are easily obtained from many software packages, for example,
qbeta(p,X*+a,N*-X*+b) in R,
quantile('beta',p,X*+a,N*-X*+b) in SAS,
beta.inv(p,X*+a,N*-X*+b)in Excel.
'p' is a probability related to the confidence level. For example, for a two-sided 95% confidence interval, p=0.025 for the lower limit and p=0.975 for the upper limit. For the example,
X* = 5.309735 * 0.75 = 3.982301
Flat prior: [0.332, 0.939]
Jeffreys pior: [0.337, 0.962]
The limits of the Bayesian interval are always between 0 and 1, and coverage is comparable to that of the quasi-score interval.
4 Hypothesis Testing
Hypothesis testing and constructing confidence intervals are complementary activities. That is, if the hypothesized null value of WMWprob (i.e., WMWprob0) falls outside the confidence interval, that is equivalent, in the testing scenario, to rejecting the null hypothesis. Conversely, the p-value for, say, a one-tailed hypothesis test is one minus the confidence level that causes WMWprob0 to fall exactly on the boundary of the one-sided confidence interval. In fact, confidence intervals are often obtained by "inverting" a hypothesis test.
​
Two common testing methods are the Wald test and the score test (or Lagrange multiplier test): see https://en.wikipedia.org/wiki/Wald_test and https://en.wikipedia.org/wiki/Score_test . Applied to the WMW problem, they are as follows.
Wald test: z = (eWMWprob - WMWprob0) / StdErr.
​
For the example with WMWprob0 = 0.50,
z = 1.330, two-tailed p-value = 0.1834.
Quasi-score test: z = (eWMWprob - WMWprob0) / StdErr0
where StdErr0 is the standard error under the null hypothesis,
StdErr0 = sqrt[WMWprob0 * (1 - WMWprob0) / N*].
For the example,
StdErr0 = 0.2170 , z = 1.152, two-tailed p-value = 0.2493.
​
A third option is a Bayesian test. In this case there is no z test statistic. Instead, tail probabilities are calculated directly from the posterior beta distribution. Again, this is straightforward in many software packages
pbeta(WMWprob0, X*+a, N*- X*+b) in R,
cdf('beta',WMWprob0,X*+a,N*-X*+b) in SAS,
beta.dist(WMWprob0,X*+a,N*-X*+b,TRUE)in Excel.
These formulas give the *lower* tail p-value (<= WMWprob0). One minus the lower tail p-value is the upper tail p-value. The two-tailed p-value is two times the smaller of the two one-tailed p-values.
For the example, using the flat prior (a=1,b=1), the lower tail p-value equals 0.1456, giving a two-tailed p-value of 0.2912. Using Jeffreys' prior gives a two-tailed p-value of 0.2487, very close to that of the quasi-score test.
To emphasize the point that an interval is the inversion of a test, here is an example of inverting the Wald test. Specifically, consider the following one-tailed test with significance level 0.05, noting that the critical z-value for rejecting the null hypothesis is (in R) qnorm(0.05) = -1.645.
Reject H0: WMWprob >= WMWprob0 in favor of H1: WMWprob < WMWprob0 if z < -1.645. Rewrite
z = (eWMWprob - WMWprob0) / StdErr < -1.645
eWMWprob - WMWprob0 < -1.645 * StdErr
eWMWprob + 1.645 * StdErr < WMWprob0
H0 is rejected, with the conclusion that WMWprob < WMWprob0, if the entire one-sided interval (with lower limit set to zero) is below WMWprob0; that is, WMWprob0 is outside the interval, above the upper confidence limit.
more to come