Trends towards non-significance
Evening up
Like many an applied statistician, I have encountered many forms of the trend towards significance description that tends to get attached to P-values that are not below the magic threshold of 0.05. "The result showed a strong trend towards significance (P=0.058)." I have occasionally retaliated sarcastically by suggesting that values such as (say) P=0.042 should be described as exhibiting a trend towards non-significance.
The fragility index[1] could be regarded as a contribution towards this negative point of view: what would have to happen for the significance to be lost? For binary outcomes it poses the question "how many results would have to change for significance to turn into non-significance". I shall explain how it does this in due course but first I shall make a couple of remarks about binary outcomes in clinical trials as follows:
Having got that out of the way I shall accept point 1 and ignore point 2 and get on with explaining what the fragility index does.
Fisher's inexact test
The fragility index was proposed by Walsh et al[1] in 2014 and can be explained in terms of two tables based on a Figure 1 of theirs. The first is a generic 2x2 table analysed using Fisher's exact test and given in Table 1.
Table 1. Generic representation of a 2x2 table analysed using Fisher's exact test. The observed frequencies are represented by the letters a,b,c,d. It is supposed that the results have been analysed and the P-value is less that 0.05.
At this point, for the purpose of further discussion, I shall assume that we are interested in results in a particular direction, have calculated the one-tailed P-value using Fisher's exact test and doubled it. There are other ways to proceed that I shall not consider and if you don't like my proposal suppose, instead that we are talking about one sided significance with P<0.025. I shall also assume that events are good and not bad so that, other things being equal, if there were more events in the control group the result would be less impressive.
Now Walsh et al consider another similar generic table such as is represented by Table 2 below. We suppose that the numbers of patients are as before but we had observed f more events in the control group.
Table 2. Generic representation of a 2x2 table analysed using Fisher's exact test. The frquencies in table 1 have been perturbed in the control group by a positive number f. It is supposed that the results have been analysed and the P-value is greater than or equal to 0.05.
The question they then pose is " how large does f> 0 have to be be for Table 1 (significant) to turn into Table 2 (not-significant)?". The value of f is the fragility index.
A first point to note is that this is rather an odd thing to do. Fisher's exact test conditions on both margins: not only the numbers of patents on each arm, which may have been fixed by design, but the numbers of events and non-events, which will usually not have been fixed. The only exception to their not having been fixed I can think of is if the binary values have been created using a dichotomy based on some quantile split (most obviously the median). Dichotomies are common, alas, but this form is rare in clinical trials.
Now, treating the second margin as fixed is controversial. Fisher argued it was the right thing to do and I think he was right. (Some information may be lost in the most extreme case.) Of course, not everybody agrees. However, if you don't agree, it is rather strange to use Fisher's exact test in the first place. On the other hand, if you do agree, it is strange to perturb the table without keeping the margins fixed. A more logical question would seem to be, "given that there were a+c events in total, what would the result be if they had split a+f versus c-f rather than a versus c?". This is illustrated in Table 3 below
Table 3. Generic representation of a 2x2 table analysed using Fisher's exact test and holding the marginal totals constant. It is supposed that the results have been analysed and the P-value is greater than or equal to 0.05.
What argument could there be for not keeping the margins fixed? I suppose one argument could be that a random error has caused one to be analysing the wrong table. So it is possible, for example, that a coding or measurement error or some other mistake has led one to use Table 1 but, if only the error could have been identified, we should have been faced with Table 2. If there were no error, it would be right to consider the set of tables with the current margins as the reference set but if error is entertained, this argument is no longer valid.
I am not sure what I feel about this. For the moment I keep an open mind.
Here's one I made earlier
Now. it just so happens that for a very different purpose, I had done the necessary calculations to answer the revised question that corresponds to Table 3 rather than Table 2. I could discuss this other purpose, which was an investigation as to whether, in a context of multiple testing, dichotomies were, after all, a good idea but that would just spoil for you the pleasure you would otherwise have in reading section 10.2.16 of your third edition of Statistical Issues in Drug Development[2] to find out, so I shan't. (If you are not lucky enough to own a copy yet of this attractively priced and engagingly written book, you will be pleased to know that it is available in all the best internet outlets for example this one.)
Figure 1. All possible one tailed P-values for a clinical trial with 10 patients given the interventional treatment and 10 given a control treatment. Each curve is labelled by the number of good outcomes in total, the X axis gives the number of such outcomes under intervention and the Y axis gives the P-value. (Note the log scale.) The significance level is supposed to have been set to 2.5% (one-sided).
Figure 1 is a colour version of Figure 10.3 from the book. Take, for example, the red dashed curve with the red diamonds. This plots all possible P-values when the total number of good outcomes is 6. The case that most favours intervention is a 0 (control) to 6 (intervention) split. The one-sided P-value is 0.0054. There are only 7 possible P-values for this total number of good outcomes and they are:
P 1.0000 , 0.9946 , 0.9296, 0.6858, 0.3142, 0.0704, 0.0054.
Recommended by LinkedIn
These correspond to values of f in Table 3 running from seven down to zero. The P-value for f=1 is 0.0704 and this means that the fragility index is 1, since if there is 1 more success in the control group the significance is gone.
Well, that's not quite right. I had redefined the fragility index, so let's give my redefined index a different name. Let's call it the fishing index (fi) in honour of the fact that it respects the logic of Fisher's exact test. So the fishing index is 1.
What about the fragility index itself? To calculate this I need to move across the curves. I increase the number of events in the control group from 0 to 1. This will increase the total number of events to 7, so I now need to move to the curve with the blue triangles. but stay with 6 events on the X axis. This moves me to just above the critical level of P=0.025 and in fact the value is 0.029 so not significant. Hence the fragility index is 1.
If you want to check this, there is a very handy online calculator, here.
To the ns of The Earth
Now consider an analogous plot to Figure 1 but with 30 patients per arm. This is given by Figure 2
Figure 2. All possible one tailed P-values for a clinical trial with 30 patients given the interventional treatment and 30 given a control treatment. Each curve is labelled by the number of good outcomes in total, the X axis gives the number of such outcomes under intervention and the Y axis gives the P-value. (Note the log scale.) The significance level is supposed to have been set to 2.5% (one-sided) and is shown as a horizontal line. A lower line indicates the P-value of 0.0054 found in the previous example.
If the figures is studied carefully, it will be seen that for the curve corresponding to 18 good outcomes in total (red lozenges), the case where there are 14 such cases in the intervention has a P-value just below that indicated by the lower dashed line. In fact the P-value is 0.0051 whereas for the case we had previously it was 0.0054. There is little to choose between these P-values. However, to judge the fragility index we need to move onto the next curve for 19 good outcomes (blue circles) and for the case where, again, 14 of these are in the intervention group. We note with satisfaction that the P-value is 0.0125 and hence still significant. Moving to the next highest case, 20 good outcomes (black circles), the P-value for 14 being in the intervention group is 0.0269 and so the result is no longer significant at the 2.5% one-sided. Hence the fragility index is 2.
So if we compare the two cases in the first case we have P=0.0054, f=1 and in the second case P=0.0051, f=2. It seems there is little to choose between the two cases in terms of probability but the second is less fragile.
Metric or bet trick?
But is f the right metric? After all the P-value itself is a probability calculated on numbers but we don't make the inference on the numbers directly. Shouldn't we calculate some probability based on f?
How could we calculate such a value? This is rather difficult, since we need a model. Which of the following would you bet is more or likely:
One of the problems with such a calculation would be it would depend on the number of positive results there should have been. However, since a P-value is a probability caculated under the null, one could argue that one would expect an equal number in both groups but clearly the observed numbers don't match this, so what should we condition on?
I don't see a clear answer. However, here is an argument I could use.
Bigger Ps have little Ps upon their backs to bite em
More than two centuries on, we are still arguing about what P-values mean. It now seems that P-values are attracting their own f indices for us to argue about.
Luckily, I have retired.
Acknowledgements
I thank Adan Becerra for drawing this to my attention and Andrew Althouse and others on Twitter fror helpful discussions.
References
1. Walsh M, Srinathan SK, McAuley DF, et al. The statistical significance of randomized controlled trial results is frequently fragile: a case for a Fragility Index. J Clin Epidemiol. Jun 2014;67(6):622-8. doi:10.1016/j.jclinepi.2013.10.019
2. Senn SJ. Statistical Issues in Drug Development. 3rd ed. John Wiley & Sons; 2021.
Associate Director, Quantitative Pharmacology, Pharmacometrics
2yRegarding the argument for a fragility index, is this not simply a consequence of the fact that a measured p-value is a realization of a random variable, or am I missing something? (I prefer to follow the usual statistician convention of using a capital letter for a random variable and a lower case letter for its realization, but unfortunately this convention is almost always ignored for P-values.) P-values are indeed random variables and do have errors associated with them. If this fact were more widely appreciated, it seems to me that the case for inventing a fragility index would seem almost moot—why not just report the uncertainty associated with a particular p-value, e.g., by bootstrapping? Boos and Stefanski (2011) drew attention to this fact in their well-worth-reading paper "P-value precision and reproducibility". They wrote, "... However, the variability of a p-value is typically not assessed or reported in routine data analysis, even when it is the primary statistical estimate. More generally, statisticians are quick to emphasize that statistics have sampling distributions and that it is important to interpret those statistics in view of their sampling variation. However, the caveat about interpretation is often overlooked when the statistic is a p-value...." I agree with this.