Showing posts with label p-hacking. Show all posts
Showing posts with label p-hacking. Show all posts

Saturday, 5 March 2016

There is a reproducibility crisis in psychology and we need to act on it


The Müller-Lyer illusion: a highly reproducible effect. The central lines are the same length but the presence of the fins induces a perception that the left-hand line is longer.

The debate about whether psychological research is reproducible is getting heated. In 2015, Brian Nosek and his colleagues in the Open Science Collaboration showed that they could not replicate effects for over 50 per cent of studies published in top journals. Now we have a paper by Dan Gilbert and colleagues saying that this is misleading because Nosek’s study was flawed, and actually psychology is doing fine. More specifically: “Our analysis completely invalidates the pessimistic conclusions that many have drawn from this landmark study.” This has stimulated a set of rapid responses, mostly in the blogosphere. As Jon Sutton memorably tweeted: “I guess it's possible the paper that says the paper that says psychology is a bit shit is a bit shit is a bit shit.”
So now the folks in the media are confused and don’t know what to think.
The bulk of debate has been focused on what exactly we mean by reproducibility in statistical terms. That makes sense because many of the arguments hinge on statistics, but I think that ignores the more basic issue, which is whether psychology has a problem. My view is that we do have a problem, though psychology is no worse than many other disciplines that use inferential statistics.
In my undergraduate degree I learned about stuff that was on the one hand non-trivial and on the other hand solidly reproducible. Take for instance, various phenomena in short-term memory. Effects like the serial position effect, the phonological confusability effect, the superiority of memory for words over nonwords, are solid and robust. In perception, we have striking visual effects such as the Müller-Lyer illusion, which demonstrate how our eyes can deceive us. In animal learning, the partial reinforcement effect is solid. In psycholinguistics, the difficulty adults have discriminating sound contrasts that are not distinctive in their native language is solid. In neuropsychology, the dichotic right ear advantage for verbal material is solid. In developmental psychology, it has been shown over and over again that poor readers have deficits in phonological awareness. These are just some of the numerous phenomena studied by psychologists that are reproducible in the sense that most people understand it, i.e. if I were to run an undergraduate practical class to demonstrate the effect, I’d be pretty confident that we’d get it. They are also non-trivial, in that a lay person would not just conclude that the result could have been predicted in advance.
The Reproducibility Project showed that many effects described in contemporary literature are not like that. But was it ever thus? I’d love to see the reproducibility project rerun with psychology studies reported in the literature from the 1970s – have we really got worse, or am I aware of the reproducible work just because that stuff has stood the test of time, while other work is forgotten?
My bet is that things have got worse, and I suspect there are a number of reasons for this:
1. Most of the phenomena I describe above were in areas of psychology where it was usual to report a series of experiments that demonstrated the effect and attempted to gain a better understanding of it by exploring the conditions under which it was obtained. Replication was built in to the process. That is not common in many of the areas where reproducibility of effects is contested.
2. It’s possible that all the low-hanging fruit has been plucked, and we are now focused on much smaller effects – i.e., where the signal of the effect is low in relation to background noise. That’s where statistics assumes importance. Something like the phonological confusability effect in short-term memory or a Müller-Lyer illusion is so strong that it can be readily demonstrated in very small samples. Indeed, abnormal patterns of performance on short-term memory tests can be used diagnostically with individual patients. If you have a small effect, you need much bigger samples to be confident that what you are observing is signal rather than noise. Unfortunately, the field has been slow to appreciate the importance of sample size and many studies are just too underpowered to be convincing.

3. Gilbert et al raise the possibility that the effects that are observed are not just small but also more fragile, in that they can be very dependent on contextual factors. Get these wrong, and you lose the effect. Where this occurs, I think we should regard it as an opportunity, rather than a problem, because manipulating experimental conditions to discover how they influence an effect can be the key to understanding it. It can be difficult to distinguish a fragile effect from a false positive, and it is understandable that this can lead to ill-will between original researchers and those who fail to replicate their finding. But the rational response is not to dismiss the failure to replicate, but to first do adequately powered studies to demonstrate the effect and then conduct further studies to understand the boundary conditions for observing the phenomenon. To take one of the examples I used above, the link between phonological awareness and learning to read is particularly striking in English and less so in some other languages. Comparisons between languages thus provide a rich source of information for understanding how children become literate. Another of the effects, the right ear advantage in dichotic listening holds at the population level, but there are individuals for whom it is absent or reversed. Understanding this variability is part of the research process.
4. Psychology, unlike many other biomedical disciplines, involves training in statistics. In principle, this is thoroughly good thing, but in practice it can be a disaster if the psychologist is simply fixated on finding p-values less than .05 – and assumes that any effect associated with such a p-value is true. I’ve blogged about this extensively, so won’t repeat myself here, other than to say that statistical training should involve exploring simulated datasets so that the student starts to appreciate the ease with which low p-values can occur by chance when one has a large number of variables and a flexible approach to data analysis. Virtually all psychologists misunderstand p-values associated with interaction terms in analysis of variance – as I myself did until working with simulated datasets. I think in the past this was not such an issue, simply because it was not so easy to conduct statistical analyses on large datasets – one of my early papers describes how to compare regression coefficients using a pocket calculator, which at the time was an advance on other methods available! If you have to put in hours of work calculating statistics by hand, then you think hard about the analysis you need to do. Currently, you can press a few buttons on a menu and generate a vast array of numbers – which can encourage the researcher to just scan the output and highlight those where p falls below the magic threshold of .05. Those who do this are generally unaware of how problematic this is, in terms of raising the likelihood of false positive findings.
Nosek et al have demonstrated that much work in psychology is not reproducible in the everyday sense that if I try to repeat your experiment I can be confident of getting the same effect. Implicit in the critique by Gilbert et al is the notion that many studies are focused on effects that are both small and fragile, and so it is to be expected they will be hard to reproduce. They may well be right, but if so, the solution is not to deny we have a problem, but to recognise that under those circumstances there is an urgent need for our field to tackle the methodological issues of inadequate power and p-hacking, so we can distinguish genuine effects from false positives.


Tuesday, 26 January 2016

The Amazing Significo: why researchers need to understand poker

©www.savagechickens.com
Suppose I tell you that I know of a magician, The Amazing Significo, with extraordinary powers. He can undertake to deal you a five-card poker hand which has three cards with the same number.

You open a fresh pack of cards, shuffle the pack and watch him carefully. The Amazing Significo deals you five cards and you find that you do indeed have three of a kind.

According to Wikipedia, the chance of this happening by chance when dealing from an unbiased deck of cards is around 2 per cent - so you are likely to be impressed. You may go public to endorse The Amazing Significo's claim to have supernatural abilities.

But then I tell you that The Amazing Significo has actually dealt five cards to 49 other people that morning, and you are the first one to get three of a kind. Your excitement immediately evaporates: in the context of all the hands he dealt, your result is unsurprising.

Let's take it a step further and suppose that The Amazing Significo was less precise: he just promised to give you a good poker hand without specifying the kind of cards you would  get. You regard your hand as evidence of his powers, but you would have been equally happy with two pairs, a flush, or a full house. The probability of getting any one of those good hands goes up to 7 per cent, so in his sample of 50 people, we'd expect three or four to be very happy with his performance.

So context is everything. If The Amazing Significo had dealt a hand to just one person and got a three-of-a-kind hand, that would indeed be amazing. If he had dealt hands to 50 people, and predicted in advance which of them would get a good hand, that would also be amazing. But if he dealt hands to 50 people and just claimed that one or two of them would get a good hand without prespecifying which ones it would be - well, he'd be rightly booed off the stage.

When researchers work with probabilities, they tend to see p-values as measures of the size and importance of a finding. However, as The Amazing Significo demonstrates, p-values can only be interpreted in the context of a whole experiment: unless you know about all the comparisons that have been made (corresponding to all the people who were dealt a hand) they are highly misleading.

In recent years, there has been growing interest in the phenomenon of p-hacking - selecting experimental data after doing the statistics to ensure a p-value below the conventional cutoff of .05. It is recognised as one reason for poor reproducibility of scientific findings, and it can take many forms.

I've become interested in one kind of p-hacking, use of what we term 'ghost variables' - variables that are included in a study but not reported unless they give a significant result. In a recent paper (preprint available here), Paul Thompson and I simulated the situation when a researcher has a set of dependent variables, but reports only those with p-values below .05. This would be like The Amazing Significo making a film of his performances in which he cut out all the cases where he dealt a poor hand**. It is easy to get impressive results if you are selective about what you tell people. If you have two groups of people who are equivalent to one another, and you compare them on just one variable, then the chance that you will get a spurious 'significant' difference (p < .05)  is 1 in 20. But with eight variables, the chance of a false positive 'significant' difference on any one variable is 1-.95^8, i.e. 1 in 3. (If variables are correlated these figures change: see our paper for more details).

Quite simply p-values are only interpretable if you have the full context: if you pull out the 'significant' variables and pretend you did not test the others, you will be fooling yourself - and other people - by mistaking chance fluctuations for genuine effects. As we showed with our simulations, it can be extremely difficult to detect this kind of p-hacking, even using statistical methods such as p-curve analysis, which were designed for this purpose. This is why it is so important to either specify statistical tests in advance (akin to predicting which people will get three of a kind), or else adjust p-values for the number of comparisons in exploratory studies*.

Unfortunately, there are many trained scientists who just don't understand this. They see a 'significant' p-value in a set of data and think it has to be meaningful. Anyone who suggests that they need to correct p-values to take into account the number of statistical tests - be they correlations in a correlation matrix, coefficients in a regression equation, or factors and interactions in Analysis of Variance, is seen as a pedantic killjoy (see also Cramer et al, 2015). The p-value is seen as a property of the variable it is attached to, and the idea that it might change completely if the experiment were repeated is hard for them to grasp.

This mass delusion can even extend to journal editors, as was illustrated recently by the COMPare project, the brainchild of Ben Goldacre and colleagues. This involves checking whether the variables reported in medical studies correspond to the ones that the researchers had specified before the study was done and informing journal editors when this was not the case. There's a great account of the project by Tom Chivers in this Buzzfeed article, which I'll let you read for yourself. The bottom line is that the editors of the Annals of Internal Medicine appear to be people who would be unduly impressed by The Amazing Significo because they don't understand what Geoff Cumming has called 'the dance of the p-values'.



*I am ignoring Bayesian approaches here, which no doubt will annoy the Bayesians


**PS.27th Jan 2016.  Marcus Munafo has drawn my attention to a film by Derren Brown called 'the System' which pretty much did exactly this! http://www.secrets-explained.com/derren-brown/the-system