Trial By Error: More on the Revised Cochrane Exercise Review

By David Tuller, DrPH

Cochrane’s republication last week of its seriously problematic exercise-for-CFS systematic review has triggered an outpouring of comment about the organization’s flawed decision-making and low-quality scientific reasoning. One very smart member of the Science For ME forum, Michiel Tack, posted an excellent overview of the changes between the prior version and the one published last week.

I am reposting it below with his permission. A big disappointment in the review is the last point Michiel makes–that the new version maintains that the evidence for fatigue reduction through exercise is of “moderate” quality. Yet from e-mails exchanged in May between Cochrane and the Norwegian institute representing the review authors, it seemed likely that the quality or certainty of the evidence would be downgraded. David Tovey, Cochrane’s recently retired editor-in-chief, stated clearly in that e-mail exchange that the evidence for fatigue reduction could only be described as of “low” quality or certainty, as I recently reported. Perhaps Karla Soares-Weiser, the new editor-in-chief, disagreed with that position. If so, she is clearly wrong.

Another disappointing aspect noted by Michiel is that PACE is rated as having a low risk of selective reporting bias. This goes to the question of whether changes made after recruitment of participants can be called “pre-specified.” In open label trials relying on subjective outcomes–like every single study in the Cochrane review–investigators are likely to know general outcome trends before looking at the actual data. So it is absurd to claim that choices made in such studies after data has been collected can be called “pre-designated.” It is a disgrace that Cochrane has allowed this bogus maneuver.

Thanks, Michiel, for your excellent work! Others can follow the online discussion on Science For ME here.

**********

I thought it might be useful to get an overview of the major changes compared to the 2017 version. I prefer focusing on the main comparison of exercise therapy versus a passive control condition…

1) Different description of CFS
Myalgic encephalomyelitis (ME) makes its way into the abstract. The description of CFS has been changed from a common, debilitating and serious serious health problem characterized by medically unexplained fatigue to ”a serious disorder characterised by persistent postexertional fatigue and substantial symptoms related to cognitive, immune and autonomous dysfunction.”

2) Diagnostic criteria
The amended review makes clear that the results only apply to patients selected with the Fukuda or Oxford criteria. The conclusion in the abstract now reads: “All studies were conducted with outpatients diagnosed with 1994 criteria of the Centers for Disease Control and Prevention or the Oxford criteria, or both. Patients diagnosed using other criteria may experience different effects.”

3) Standard mean differences (SMD)
The 2017 version focused on mean differences (MD) where all the results that use the same version of a questionnaire are pooled together. The problem with this approach is that you don’t get an overview of all the results for one outcome (say fatigue) if different questionnaires were used. And that’s of course what interest readers the most: the result for all fatigue outcomes taken together. That requires a pooling of results on different questionnaires for the same outcome into what is called a standardized mean difference (SMD). In the old version, SMD’s were only reported in the sensitivity analysis. The late Robert Courtney pointed out that this not according to the protocol (Edmonds et al. 2004) and that it allowed the authors to present their results more favorably. One example: the effect on fatigue at follow up was not statistically significant when expressed in SMD, but by focusing on MD’s for separate versions of the Chalder Fatigue Scale this was not easily visible in the review.

4) Recalculation to the 33-point Chalder Fatigue Scale
The downside of an SMD is that it is difficult to interpret because the results no longer relate to an actual questionnaire. Cochrane has therefore asked the authors to recalculate the size of SMD results for all fatigue outcomes into an MD for the 33 point version of the Chalder Fatigue Scale, which is now the most commonly used version. So first all fatigue results were pooled together and then it was calculated how large that effect would be on the 33-point version of the Chalder Fatigue Scale. The SMD for fatigue was -0.66 suggesting a moderate effect size. But when reexpressed on the Chalder Fatigue Scale, this corresponded to a 3.4 point reduction on the Chalder Fatigue Scale, which seems rather small.

5) Minimal Important Differences (MID)
To estimate whether a 3.4 point reduction on the Chalder Fatigue Scale is clinically significant, the authors searched for minimal important differences (MID). They found no study on CFS that did this but a paper on Lupus reported a threshold around 2.3 points on the Chalder Fatigue Scale. According to the authors, this indicates that the change caused by exercise therapy was clinically significant. They estimated MID for other outcomes measures as well.

6) Standardised language reflecting the GRADE assessment system
In the old review, the authors did not use a consistent method to describe the strength of evidence. They made statements that reflect their own impression of the evidence such as “encouraging evidence suggests that exercise therapy can contribute to alleviation of some symptoms of CFS” or “Patients with CFS may generally benefit […] following exercise therapy” or “We think the evidence suggests that exercise therapy might be an effective and safe intervention” or “seven studies consistently showed a reduction in fatigue following exercise therapy at end of treatment”. The new wording is standardized and reflects quality scores of the GRADE assessment system. The word ‘probably’ reflects moderate-quality evidence, ‘may’ reflects low-quality evidence and ‘uncertain’ reflects very low quality evidence. In general, this means that the results are more carefully worded to reflect the underlying evidence. One example: In the 2017 version the word ‘uncertain’ was used once, in the amended version it is used 76 times.

7) Evidence on adverse events becomes ‘uncertain’
One of the most notable changes of consistently using the GRADE assessment system is how the evidence on adverse events is presented. The new version restricts itself to cautious statements such as “we are uncertain about the risk of serious adverse reactions because the certainty of the evidence is very low.” The previous version did recognize that sparse data made it difficult to draw conclusion, but it also made strong statements such as “no evidence suggests that exercise therapy may worsen outcomes” or “few serious adverse reactions were reported” or “exercise therapy did not worsen symptoms for people with CFS.” In their conclusion the author wrote: “We think the evidence suggests that exercise therapy might be an […] safe intervention.” These statements have now been deleted or reworded.

8) Uncertain results at follow-up
Another notable change is the evidence on the long-term follow-up for outcomes such as fatigue and physical function. The analysis of the data shows that at this measurement point the improvements were no longer statistically significant. As the late Robert Courtney pointed out, this was not mentioned in the abstract or explained in the main text. The old abstract confusingly wrote that “study authors reported a positive effect of exercise therapy at end of treatment with respect to […] physical function […] and self-perceived changes in overall health.” It was not made clear that this ‘positive effect’ was not statistically significant when data were pooled together. The results for fatigue at follow-up were not mentioned in the abstract. The new abstract makes clear that for each outcome except for sleep the results at follow-up are uncertain because the certainty of the evidence is very low.

9) Elaboration of the summary of findings tables
The results for fatigue and physical function at follow-up are now presented in the summary of findings tables, which wasn’t the case in the previous version. Instead of mentioning whether a measurement was taken post-treatment or at follow-up, the summary tables now give the exact time point or interval of outcome assessments. Overall, these summary of findings tables have become more elaborated and also present the results for comparison 2 exercise therapy versus psychological treatment, comparison 3 exercise therapy versus adaptive pacing therapy and 4 exercise therapy versus antidepressants.

10) Probably
The authors have rated the results for fatigue post-treatment as moderate quality, which is reflected in the wording “exercise therapy probably has a positive effect on fatigue.” The old version also rated the evidence for post-treatment fatigue as ‘moderate-quality’ but it used a different phrasing. The conclusion wrote: “Patients with CFS may […] feel less fatigued following exercise therapy.” The word probably wasn’t used.

11) High risk of performance and detection bias highlighted
The amended abstract makes clear that the studies in the review have a high risk of bias for certain domains. It reads: “Most studies had a low risk of selection bias. All had a high risk of performance and detection bias.” The old version was more ambiguous and wrote: “Risk of bias varied across studies, but within each study, little variation was found in the risk of bias across our primary and secondary outcome measures.” In the Discussion section the old version even claimed that “risk of bias across studies was relatively low.”

12) The 11-point version of the Chalder Fatigue Scale for the FINE Trial
The authors have now used the 11-point version of the Chalder Fatigue Scale for the FINE trial (Wearden et al. 2010) instead of the 33-point version, which was not published in the peer-review literature. This has caused a change in the SMD for the FINE Trial from -0.43 to -0.27. The overall SMD for fatigue however only changed little because of this: instead of -0.68 [-1.02, -0.35] it now reads -0.66 [-1.01,-0.31].

13) More sensitivity analyses
The amended review has more sensitivity analyses. These are extra analyses made to see if the results remain the same if something is interpreted differently or if some studies are left out of the analysis. The old version tested for example how excluding the study by Powell et al. 2001, influenced the results because this study reported much larger improvements than other studies. The new version also tests how exclusion of the PACE and FINE trial influences the results for key outcomes such a fatigue and physical function. The amended review also has sensitivity analyses for outcomes of sleep and self-perceived changes in overall health, which were not reported in the old version.

14) Two additional studies mentioned: GETSET and Marques et al.
The authors noted that since they have performed their systemic search of the literature in may 2014, two more randomized trials have been published that are relevant and could be included in future updates. These have also reported positive findings for GET:

Marques M, De Gucht V, Leal I, Maes S. Effects of a selfregulation based physical activity program (the “4-STEPS”) for unexplained chronic fatigue: a randomized controlled trial. International Journal of Behavioral Medicine 2015;2:187-96. [DOI: 10.1007/s12529-014-9432-4]

Clarke LV, Pesola F, Thomas JM, Vergara-Williamson M, Beynon M, White PG. Guided graded exercise self-help plus specialist medical care versus specialist medical care alone for chronic fatigue syndrome (GETSET): a pragmatic randomised controlled trial. Lancet 2017;390(10092):363-73. [DOI: 10.1016/ S0140-6736(16)32589-2]

15) Extra feedback and comments
Extra feedback has been submitted. According to Richard Gardner the statement that there is no evidence that exercise therapy may worsen outcome, may be misleading as no conclusion could be made about the drop-out rates. Adrienne Wooding noted that the Cochrane review erroneously places ME/CFS in its mental health category. Mark Vink referred to his reanalysis and critique of the Cochrane review which indicates that objective outcomes generally do not show improvements following exercise therapy.

16) Minor, non-important changes to the text
If one puts the old and amended texts next to each other, one will notice that some sections have been rewritten, shortened or reformatted. In my view, these are not important changes to the analysis. Instead, they seem more like clarifications, explanations of the changes made or shortening of the text because it had otherwise become too long. I have therefore chosen not to specify these minor changes in detail because the overview would then be much more complicated. If anyone does see important changes to the text that I have overlooked, please let me know, so that this overview can be updated.

The following changes were proposed but rejected:

1) Objective outcomes
Tom Kindlon and Robert Courtney noted that with the exception for health resource use, Larun et al. have not reported on objective outcomes. The randomized trials included in the review had data on outcomes such as exercise testing, a fitness test, the six minute walking test, employment status and disability payments. Objective outcomes tend to be less influenced by bias due to a lack of blinding. The analysis by Vink &Vink-Niese showed that with some exceptions, objective outcomes generally have not significantly improved following exercise therapy. Back in 2015, the authors responded that “the protocol for this review did not include objective measurements.” But they did seem to agree that objective measures should be carefully considered in an update. No extra objective outcomes were reported in the amended review.

2) Compliance
Kindlon also asked about data on compliance: information on whether the trial participants really followed the exercise therapy as prescribed. He wrote: “it would be interesting if you could obtain some unpublished data from activity logs, records from heart-rate monitors, and other records to help build up a picture of what exercise was actually performed and the level of compliance.” Again, the authors seemed to agree that this is an important point that should be considered in an update of the review. No information is provided on compliance in the 2019 amendment.

3) Selective reporting in the PACE trial
Tom Kindlon and Robert Courtney both argued that the PACE trial should not be rated as low risk of bias for selective reporting. They referred to the Cochrane tool for assessing risk of bias (RoB 1), where the low risk of bias was explained as “The study protocol is available and all of the study’s pre-specified (primary and secondary) outcomes that are of interest in the review have been reported in the pre-specified way.” Kindlon and Courtney argued that this was not the case for the PACE trial and that therefore the trial should not be rated as low risk of bias. Their comments were supported by Cochrane editor Nuala Livingstone during an internal audit of Courtney’s complaint to Cochrane. In their 2015 response, Larun et al. acknowledged that changes were made to planned analysis specified in the protocol of the PACE trial but argued that “these changes were drawn up before the analysis commenced and before examining any outcome data.” In the 2019 amendment, all risk of bias judgements have remained the same, including the low risk of bias on selective reporting for the PACE trial. The authors justify this as follows: “The protocol and the statistical analysis plan were not formally published prior to recruitment of participants, and some readers, therefore, claim the study should be viewed as being a post hoc study. The study authors oppose this, and have published a minute from a Trial Steering Committee (TSC) meeting stating that any changes made to the analysis since the original protocol was agreed by TSC and signed off before the analysis commenced.”

4) Proposal to analyze the excluded data from Jason et al.
For the outcome of physical function at follow-up, the study by Jason et al. was excluded because of large baseline differences: the exercise group had much lower (39) physical function scores than the relaxation group (54). Kindlon noted that “It would be good if other methods could be investigated (e.g. using baseline levels as covariates) to analyse such data.” The authors responded that this would make the analysis very complicated and that this can be more easily addressed in a review based on individual patient data. The 2019 amendment does not use an alternative method to include the results of Jason et al. on physical function at follow-up.

5) Downgrading fatigue post-treatment to low-quality evidence
From a publicly released email exchange, we know that the previous Cochrane Editor in chief David Tovey strongly objected to the results for fatigue post-treatment to be rated as moderate quality. He wrote: “the conclusion that this is moderate certainty evidence seems indefensible to me.” Tovey argued that it could be further downgraded for inconsistency (because of considerable heterogeneity reflected by a I2 of 80%) or imprecision (because the confidence interval of the effect crosses the line of no longer being clinically significant). The authors ”“ represented by officials of the Norwegian Institute of Public Health (NIPH) – argued that heterogeneity was mostly due to the study by Powell et al.: when it was removed, the heterogeneity became acceptable while the effect size remained moderate. Regarding imprecision, they argued that GRADE only advises downgrading when the confidence interval crosses the line of no effect, not the line of a clinically significant effect. In the email correspondence, the authors did seem to agree that these were both borderline cases and open to interpretation. They, therefore, proposed the following compromise, as explained by Atle Fretheim from the NIPH: “I proposed a compromise: We simply grade the evidence for this outcome as Low-moderate. The authors have accepted to use the term ‘may’ (usually indicating low certainty evidence) when describing the certainty of the evidence, rather than the term ‘probably’ (usually indicating moderate certainty). They have also accepted not to use any categorization of the effect size.” An alternative solution proposed was to use the term “low to moderate quality evidence”. The 2019 amendment, however, uses the words “probably” and “moderate-certainty evidence”.

Start typing and press enter to search