Showing posts with label power. Show all posts
Showing posts with label power. Show all posts

Saturday, 5 March 2016

There is a reproducibility crisis in psychology and we need to act on it


The Müller-Lyer illusion: a highly reproducible effect. The central lines are the same length but the presence of the fins induces a perception that the left-hand line is longer.

The debate about whether psychological research is reproducible is getting heated. In 2015, Brian Nosek and his colleagues in the Open Science Collaboration showed that they could not replicate effects for over 50 per cent of studies published in top journals. Now we have a paper by Dan Gilbert and colleagues saying that this is misleading because Nosek’s study was flawed, and actually psychology is doing fine. More specifically: “Our analysis completely invalidates the pessimistic conclusions that many have drawn from this landmark study.” This has stimulated a set of rapid responses, mostly in the blogosphere. As Jon Sutton memorably tweeted: “I guess it's possible the paper that says the paper that says psychology is a bit shit is a bit shit is a bit shit.”
So now the folks in the media are confused and don’t know what to think.
The bulk of debate has been focused on what exactly we mean by reproducibility in statistical terms. That makes sense because many of the arguments hinge on statistics, but I think that ignores the more basic issue, which is whether psychology has a problem. My view is that we do have a problem, though psychology is no worse than many other disciplines that use inferential statistics.
In my undergraduate degree I learned about stuff that was on the one hand non-trivial and on the other hand solidly reproducible. Take for instance, various phenomena in short-term memory. Effects like the serial position effect, the phonological confusability effect, the superiority of memory for words over nonwords, are solid and robust. In perception, we have striking visual effects such as the Müller-Lyer illusion, which demonstrate how our eyes can deceive us. In animal learning, the partial reinforcement effect is solid. In psycholinguistics, the difficulty adults have discriminating sound contrasts that are not distinctive in their native language is solid. In neuropsychology, the dichotic right ear advantage for verbal material is solid. In developmental psychology, it has been shown over and over again that poor readers have deficits in phonological awareness. These are just some of the numerous phenomena studied by psychologists that are reproducible in the sense that most people understand it, i.e. if I were to run an undergraduate practical class to demonstrate the effect, I’d be pretty confident that we’d get it. They are also non-trivial, in that a lay person would not just conclude that the result could have been predicted in advance.
The Reproducibility Project showed that many effects described in contemporary literature are not like that. But was it ever thus? I’d love to see the reproducibility project rerun with psychology studies reported in the literature from the 1970s – have we really got worse, or am I aware of the reproducible work just because that stuff has stood the test of time, while other work is forgotten?
My bet is that things have got worse, and I suspect there are a number of reasons for this:
1. Most of the phenomena I describe above were in areas of psychology where it was usual to report a series of experiments that demonstrated the effect and attempted to gain a better understanding of it by exploring the conditions under which it was obtained. Replication was built in to the process. That is not common in many of the areas where reproducibility of effects is contested.
2. It’s possible that all the low-hanging fruit has been plucked, and we are now focused on much smaller effects – i.e., where the signal of the effect is low in relation to background noise. That’s where statistics assumes importance. Something like the phonological confusability effect in short-term memory or a Müller-Lyer illusion is so strong that it can be readily demonstrated in very small samples. Indeed, abnormal patterns of performance on short-term memory tests can be used diagnostically with individual patients. If you have a small effect, you need much bigger samples to be confident that what you are observing is signal rather than noise. Unfortunately, the field has been slow to appreciate the importance of sample size and many studies are just too underpowered to be convincing.

3. Gilbert et al raise the possibility that the effects that are observed are not just small but also more fragile, in that they can be very dependent on contextual factors. Get these wrong, and you lose the effect. Where this occurs, I think we should regard it as an opportunity, rather than a problem, because manipulating experimental conditions to discover how they influence an effect can be the key to understanding it. It can be difficult to distinguish a fragile effect from a false positive, and it is understandable that this can lead to ill-will between original researchers and those who fail to replicate their finding. But the rational response is not to dismiss the failure to replicate, but to first do adequately powered studies to demonstrate the effect and then conduct further studies to understand the boundary conditions for observing the phenomenon. To take one of the examples I used above, the link between phonological awareness and learning to read is particularly striking in English and less so in some other languages. Comparisons between languages thus provide a rich source of information for understanding how children become literate. Another of the effects, the right ear advantage in dichotic listening holds at the population level, but there are individuals for whom it is absent or reversed. Understanding this variability is part of the research process.
4. Psychology, unlike many other biomedical disciplines, involves training in statistics. In principle, this is thoroughly good thing, but in practice it can be a disaster if the psychologist is simply fixated on finding p-values less than .05 – and assumes that any effect associated with such a p-value is true. I’ve blogged about this extensively, so won’t repeat myself here, other than to say that statistical training should involve exploring simulated datasets so that the student starts to appreciate the ease with which low p-values can occur by chance when one has a large number of variables and a flexible approach to data analysis. Virtually all psychologists misunderstand p-values associated with interaction terms in analysis of variance – as I myself did until working with simulated datasets. I think in the past this was not such an issue, simply because it was not so easy to conduct statistical analyses on large datasets – one of my early papers describes how to compare regression coefficients using a pocket calculator, which at the time was an advance on other methods available! If you have to put in hours of work calculating statistics by hand, then you think hard about the analysis you need to do. Currently, you can press a few buttons on a menu and generate a vast array of numbers – which can encourage the researcher to just scan the output and highlight those where p falls below the magic threshold of .05. Those who do this are generally unaware of how problematic this is, in terms of raising the likelihood of false positive findings.
Nosek et al have demonstrated that much work in psychology is not reproducible in the everyday sense that if I try to repeat your experiment I can be confident of getting the same effect. Implicit in the critique by Gilbert et al is the notion that many studies are focused on effects that are both small and fragile, and so it is to be expected they will be hard to reproduce. They may well be right, but if so, the solution is not to deny we have a problem, but to recognise that under those circumstances there is an urgent need for our field to tackle the methodological issues of inadequate power and p-hacking, so we can distinguish genuine effects from false positives.


Sunday, 12 January 2014

Why does so much research go unpublished?



As described in my last blogpost, I attended an excellent symposium on waste in research this week. A recurring theme was research that never got published. Rosalind Smyth described her experience of sitting on the funding panel of a medium-sized charity. The panel went to great pains to select the most promising projects, and would end a meeting with a sense of excitement about the great work that they were able to fund. A few years down the line, though, they'd find that many of the funds had been squandered. The work had either not been done, or had been completed but not published.

In order to tackle this problem, we need to understand the underlying causes. Sometimes, as Robert Burns noted, the best-laid schemes go wrong. Until you've tried to run a few research projects, it's hard to imagine the myriad different ways in which life can conspire to mess up your plans. The eight laws of psychological research formulated by Hodgson and Rollnick are as true today as they were 25 years ago.

But much research remains unpublished despite being completed. Reasons are multiple, and the strategies needed to overcome them are varied, but here is my list of the top three problems and potential solutions.

Inconclusive results


Probably the commonest reason for inconclusive results is lack of statistical power. A study is undertaken in the fond hope that a difference will be found between condition X and condition Y, and if the difference is found, there is great rejoicing and a rush to publish. A negative result should also be of interest, provided the study was well-designed and adequately motivated. But if the sample is small, then we can't be sure whether our failure to observe the effect is because it is absent: a real but small effect could be swamped by noise. 

I think the solution to this problem lies in the hands of funding panels and researchers: quite simply, they need to take statistical power very seriously indeed and to consider carefully whether anything will be learned from a study if the anticipated effects are not obtained. If not, then the research needs to be rethought. In the fields of genetics and clinical trials, it is now recognised that multicentre collaborations are the way forward to ensure that studies are conducted with sufficient power to obtain a conclusive result.

Rejection of completed work by journals


Even well-conducted and adequately powered studies may be rejected by journals if the results are not deemed to be exciting. To solve this problem, we must look to journals. We need recognition that - provided a study is methodologically strong and well-motivated - negative results can be as informative as positive ones. Otherwise we are doomed to waste time and money pursuing false leads.  As Paul Glasziou has emphasised, failure is part of the research process. It is important to tell people about what doesn't work if we are not to repeat our mistakes.

We do now have some journals that will publish negative results, and there is a growing move toward pre-registration of studies, with guaranteed publication if the methods meet quality criteria. But there is still a lot to be done, and we need a radical change of mindset about what kinds of research results are valuable.

Lack of time


Here, I lay the blame squarely on the incentive structures that operate in universities. To get a job, or to get promoted, you need to demonstrate that you can pull in research income. In many UK institutions this is quite explicit, and promotions criteria may give a specific figure to aim for of X thousand pounds research income per annum. There are few UK universities whose strategic plan does not include a statement about increasing research funding. This has changed the culture dramatically;  as Fergus Millar put it: "in the modern British university, it is not that funding is sought in order to carry out research, but that research projects are formulated in order to get funding".

Of course, for research to thrive, our Universities need people who can compete for funding to support their work. But the acquisition of funding has become an end in itself, rather than a means to an end. This has the pernicious effect of driving people to apply for grant after grant, without adequately budgeting for the time it takes to analyse and write up research, or indeed to carefully think about what they are doing.  As I argued previously, even junior researchers these days have an 'academic backlog' of unwritten papers.

At the Lancet meeting there were some useful suggestions for how we might change incentive structures to avoid such waste. Malcolm MacLeod argued researchers should be evaluated not by research income and high-impact publications, but by the quality of their methods, the extent to which their research was fully reported, and the reproducibility of findings. An-Wen Chan echoed this, arguing for performance metrics that recognise full dissemination of research and use of research datasets by other groups. However, we may ask whether such proposals have any chance of being adopted when University funding is directly linked to grant income, and Universities increasingly view themselves as businesses.

I suspect we would need revised incentives to be reflected at the level of those allocating central funding before vice-chancellors took them seriously.  It would, however, be feasible for behaviour to be shaped at the supply end, if funders adopted new guidelines. For a start, they could look more carefully at the time commitments of those to whom grants are given: in my experience this is never taken into consideration, and one can see successful 'fat cats' accumulating grant after grant, as success builds on success. Funders could also monitor more closely the outcomes of grants: Chan noted that NIHR withholds 10% of research funds until a paper based on the research has been submitted for publication. Moves like this could help us change the climate so that an award of a grant would confer responsibility on the recipient to carry through the work to completion, rather than acting solely to embellish the researcher's curriculum vitae.

References

Chan, A., Song, F., Vickers, A., Jefferson, T., Dickersin, K., Gotzsche, P., Krumholz, H. M., Ghersi, D., & van der Worp, H. B. (2014). Increasing value and reducing waste: addressing inaccessible research Lancet (8 Jan ) : 10.1016/S0140-6736(13)62296-5

Macleod, M. R., Michie, S., Roberts, I., Dirnagl, U., Chalmers, I., Ioannidis, J. P. A., . . . Glasziou, P. (2014). Biomedical research: increasing value, reducing waste. Lancet, 383(9912), 101-104.

Sunday, 10 March 2013

High-impact journals: where newsworthiness trumps methodology

Here’s a paradox: Most scientists would give their eye teeth to get a paper in a high impact journal, such as Nature, Science, or Proceedings of the National Academy of Sciences. Yet these journals have had a bad press lately, with claims that the papers they publish are more likely to be retracted than papers in journals with more moderate impact factors. It’s been suggested that this is because the high impact journals treat newsworthiness as an important criterion for accepting a paper. Newsworthiness is high when a finding is both of general interest and surprising, but surprising findings have a nasty habit of being wrong.

A new slant on this topic was provided recently by a paper by Tressoldi et al (2013), who compared the statistical standards of papers in high impact journals with those of three respectable but lower-impact journals. It’s often assumed that high impact journals have a very high rejection rate because they adopt particularly rigorous standards, but this appears not to be the case. Tressoldi et al focused specifically on whether papers reported effect sizes, confidence intervals, power analysis or model-fitting. Medical journals fared much better than the others, but Science and Nature did poorly on these criteria. Certainly my own experience squares with the conclusions of Tressoldi et al (2013), as I described in the course of discussion about an earlier blogpost.

Last week a paper appeared in Current Biology (impact factor = 9.65) with the confident title: “Action video games make dyslexic children read better.” It's a classic example of a paper that is on the one hand highly newsworthy, but on the other, methodologically weak. I’m not usually a betting person, but I’d be prepared to put money on the main effect failing to replicate if the study were repeated with improved methodology. In saying this, I’m not suggesting that the authors are in any way dishonest. I have no doubt that they got the results they reported and that they genuinely believe they have discovered an important intervention for dyslexia. Furthermore, I’d be absolutely delighted to be proved wrong: There could be no better news for children with dyslexia than to find that they can overcome their difficulties by playing enjoyable computer games rather than slogging away with books. But there are good reasons to believe this is unlikely to be the case.

An interesting way to evaluate any study is to read just the Introduction and Methods, without looking at Results and Discussion. This allows you to judge whether the authors have identified an interesting question and adopted an appropriate methodology to evaluate it, without being swayed by the sexiness of the results. For the Current Biology paper, it’s not so easy to do this, because the Methods section has to be downloaded separately as Supplementary Material. (This in itself speaks volumes about the attitude of Current Biology editors to the papers they publish: Methods are seen as much less important than Results). On the basis of just Introduction and Methods, we can ask whether the paper would be publishable in a reputable journal regardless of the outcome of the study.

On the basis of that criterion, I would argue that the Current Biology paper is problematic, purely on the basis of sample size. There were 10 Italian children aged 7 to 13 years in each of two groups: one group played ‘action’ computer games and the other was a control group playing non-action games (all games from Wii's Rayman Raving Rabbids - see here for examples). Children were trained for 9 sessions of 80 minutes per day over two weeks. Unfortunately, the study was seriously underpowered. In plain language, with a sample this small, even if there is a big effect of intervention, it would be hard to detect it. Most interventions for dyslexia have small-to-moderate effects, i.e. they improve performance in the treated group by .2 to .5 standard deviations. With 10 children per group, the power is less than .2, i.e. there’s a less than one in five chance of detecting a true effect of this magnitude. In clinical trials, it is generally recommended that the sample size be set to achieve power of around .8. This is only possible with a total sample of 20 children if the true effect of intervention is enormous – i.e. around 1.2 SD, meaning there would be little overlap between the two groups’ reading scores after intervention. Before doing this study there would have been no reason to anticipate such a massive effect of this intervention, and so use of only 10 participants per group was inadequate. Indeed, in the context of clinical trials, such a study would be rejected by many ethics committees (IRBs) because it would be deemed unethical to recruit participants for a study which had such a small chance of detecting a true effect.

But, I hear you saying, this study did find a significant effect of intervention, despite being underpowered. So isn’t that all the more convincing? Sadly, the answer is no. As Christley (2010) has demonstrated, positive findings in underpowered studies are particularly likely to be false positives when they are surprising – i.e., when we have no good reason to suppose that there will be a true effect of intervention. This seems particularly pertinent in the case of the Current Biology study – if playing active computer games really does massively enhance children’s reading, we might have expected to see a dramatic improvement in reading levels in the general population in the years since such games became widely available.

The small sample size is not the only problem with the Current Biology study. There are other ways in which it departs from the usual methodological requirements of a clinical trial: it is not clear how the assignment of children to treatments was made or whether assessment was blind to treatment status, no data were provided on drop-outs, on some measures there were substantial differences in the variances of the two groups, no adjustment appears to have been made for the non-normality of some outcome measures, and a follow-up analysis was confined to six children in the intervention group. Finally, neither group showed significant improvement in reading accuracy, where scores remained 2 to 3 SD below the population mean (Tables S1 and S3): the group differences were seen only for measures of reading speed.

Will any damage be done? Probably not much – some false hopes may be raised, but the stakes are not nearly as high as they are for medical trials, where serious harm or even death can result from wrong results. There is concern, however, that quite apart from the implications for families of children with reading problems, there is another issue here, about the publication policies of high-impact journals. These journals wield immense power. It is not overstating the case to say that a person’s career may depend on having a publication in a journal like Current Biology (see this account – published, as it happens, in Current Biology!). But, as the dyslexia example illustrates, a home in a high-impact journal is no guarantee of methodological quality. Perhaps this should not surprise us: I looked at the published criteria for papers on the websites of Nature, Science, PNAS and Current Biology. None of them mentioned the need for strong methodology or replicability; all of them emphasised “importance” of the findings.

Methods are not a boring detail to be consigned to a supplement: they are crucial in evaluating research. My fear is that the primary goal of some journals is media coverage, and consequently science is being reduced to journalism, and is suffering as a consequence.

References

Brembs, B., & Munafò, M. R. (2013). Deep impact: Unintended consequences of journal rank. arXiv:1301.3748.

Christley, R. M. (2010). Power and error: increased risk of false positive results in underpowered studies. The Open Epidemiology Journal, 3, 16-19.

Halpern, S. D.,  Karlawish, J. T, & Berlin, J. A. (2002). The continuing unethical conduct of underpowered clinical trials. Journal of the American Medical Association, 288(3), 358-362. doi: 10.1001/jama.288.3.358

Lawrence, P. A. (2007). The mismeasurement of science. Current Biology, 17(15), R583-R585. doi: 10.1016/j.cub.2007.06.014

Tressoldi, P., Giofré, D., Sella, F., & Cumming, G. (2013). High Impact = High Statistical Standards? Not Necessarily So. PLoS ONE, 8 (2) DOI: 10.1371/journal.pone.0056180