P values are ubiquitous in medical literature and are often misinterpreted. The p value is often misunderstood as probability that the null hypothesis is true. But it should be noted that the null hypothesis is not random. It can be either true or not.
In fact, a p-value is the probability under a specified null hypothesis that a statistical summary of the data would be equal to or more extreme than its observed value. For example, when we detect the difference in means of total bilirubin levels measured in two samples, we would like to know how likely it is to get such or more extreme difference when there is no actual difference between underlying populations. This is what p value tells us, and using the observed data if we find that the p value is, say 0.002, we consider the observed difference quite unlikely under the null hypothesis and should be rejected
As it is well known, p values are used extensively in medical research. There is a lot of effort and cost involved in medical research and this provokes investigators to dig more and more p values from the data from all possible subsets of the data to extract as much information as possible. But if we torture the data enough just in search of a significant p value it may confess, as they say, but of course at the cost of increased false positivity. For example, given your α=0.05, for 10 independent hypotheses tests there will be 40% risk that you reject at least one null hypothesis when actually it is true. For 20 such tests the risk becomes 64%.
One of the scenarios when a problem of this kind arises is when we perform hypothesis tests on multiple sub-populations of the data, i.e., subgroup analyses. Subgroup analyses and hence generating more and more p values out of a data appear pretty much almost all the time in medical publications. However, it should be noted that both pre-specified and post-hoc subgroup analyses are subject to inflated false positive rates arising from multiple hypotheses testing.
So how to proceed when you really want to have a look into different subgroups based on the baseline characteristics? One of the good statistical methods for assessing the heterogeneity of treatment effects among the subgroups is to begin with a statistical test for interaction. If the interaction test isn't significant, there is no observable subgroup effect. It can happen that even though the two treatment effects look very different in two subgroups, and the p value looks very different, the test of interaction may not be significant. In such cases if we do not conduct a test for interaction and produce two separate p values for the two subgroups, the results could be misleading.
However, the difficulty with statistical tests for interaction could be that quite often these tests are not powerful enough to detect heterogeneity of treatment effect among subgroups. In such cases, literature suggests that, a logical approach could be to evaluating the benefits versus the risks of the treatment for the major subgroups in the study.
Also, what if the primary outcome fails? Can we look into some subgroups in search of a significant p value? Not really. For a trial in which the overall result for the primary outcome is not significant, such unplanned considerations for subgroup analysis are often misleading because of the reasons already stated above (possible lack of interaction results and the hypotheses testing not adjusted for multiple comparisons).
P values are almost inevitable parts in medical research. However, in our eternal quest for a significant p value if we keep performing unplanned statistical tests on various subgroups of a data, the results could be highly deceptive. Rather, when properly planned, reported, and interpreted, subgroup analyses can provide valuable information for the current as well as future researches. This is only a glimpse of broad outline of a big problem where a statistician is asked to conduct several unplanned subgroup analyses without understanding the implications. It is very important that the study team understands the overall problem and implications of such analyses and consider the possible solutions. Interested readers are further advised to study useful references cited below.
- Brookes ST, Whitely E, Egger Matthias, et al. Subgroup analysis in randomized trials: risk of subgroup analysis in randomized trials: risk of subgroup-specific specific analyses; power and sample size for the interaction test. Journal of Clinical Epidemiology 2004; 57: 229-36.
- Lagakos SW. The challenge of subgroup analyses – reporting without distorting. NEJM 2006; 354:1667-1669.
- Assmann SF, Pocock SJ, Kasten LE. Subgroup analysis and other (mis)uses of baseline data in clinical trials. The Lancet 2000; 355: 1064-69.
- Rothwell PM. Subgroup analysis in randomized controlled trials: importance, indications, and interpretation. The Lancet 2005; 365: 176-86.
- Mohamed Alosh, Mohammad F. Huque, Frank Bretzc, and Ralph B. D’Agostino, Sr. Tutorial on statistical considerations on subgroup analysis in confirmatory clinical trials. Statistics in Medicine 2016; 36: 1334-1360.