Saturday, 5 March 2016

There is a reproducibility crisis in psychology and we need to act on it


The Müller-Lyer illusion: a highly reproducible effect. The central lines are the same length but the presence of the fins induces a perception that the left-hand line is longer.

The debate about whether psychological research is reproducible is getting heated. In 2015, Brian Nosek and his colleagues in the Open Science Collaboration showed that they could not replicate effects for over 50 per cent of studies published in top journals. Now we have a paper by Dan Gilbert and colleagues saying that this is misleading because Nosek’s study was flawed, and actually psychology is doing fine. More specifically: “Our analysis completely invalidates the pessimistic conclusions that many have drawn from this landmark study.” This has stimulated a set of rapid responses, mostly in the blogosphere. As Jon Sutton memorably tweeted: “I guess it's possible the paper that says the paper that says psychology is a bit shit is a bit shit is a bit shit.”
So now the folks in the media are confused and don’t know what to think.
The bulk of debate has been focused on what exactly we mean by reproducibility in statistical terms. That makes sense because many of the arguments hinge on statistics, but I think that ignores the more basic issue, which is whether psychology has a problem. My view is that we do have a problem, though psychology is no worse than many other disciplines that use inferential statistics.
In my undergraduate degree I learned about stuff that was on the one hand non-trivial and on the other hand solidly reproducible. Take for instance, various phenomena in short-term memory. Effects like the serial position effect, the phonological confusability effect, the superiority of memory for words over nonwords, are solid and robust. In perception, we have striking visual effects such as the Müller-Lyer illusion, which demonstrate how our eyes can deceive us. In animal learning, the partial reinforcement effect is solid. In psycholinguistics, the difficulty adults have discriminating sound contrasts that are not distinctive in their native language is solid. In neuropsychology, the dichotic right ear advantage for verbal material is solid. In developmental psychology, it has been shown over and over again that poor readers have deficits in phonological awareness. These are just some of the numerous phenomena studied by psychologists that are reproducible in the sense that most people understand it, i.e. if I were to run an undergraduate practical class to demonstrate the effect, I’d be pretty confident that we’d get it. They are also non-trivial, in that a lay person would not just conclude that the result could have been predicted in advance.
The Reproducibility Project showed that many effects described in contemporary literature are not like that. But was it ever thus? I’d love to see the reproducibility project rerun with psychology studies reported in the literature from the 1970s – have we really got worse, or am I aware of the reproducible work just because that stuff has stood the test of time, while other work is forgotten?
My bet is that things have got worse, and I suspect there are a number of reasons for this:
1. Most of the phenomena I describe above were in areas of psychology where it was usual to report a series of experiments that demonstrated the effect and attempted to gain a better understanding of it by exploring the conditions under which it was obtained. Replication was built in to the process. That is not common in many of the areas where reproducibility of effects is contested.
2. It’s possible that all the low-hanging fruit has been plucked, and we are now focused on much smaller effects – i.e., where the signal of the effect is low in relation to background noise. That’s where statistics assumes importance. Something like the phonological confusability effect in short-term memory or a Müller-Lyer illusion is so strong that it can be readily demonstrated in very small samples. Indeed, abnormal patterns of performance on short-term memory tests can be used diagnostically with individual patients. If you have a small effect, you need much bigger samples to be confident that what you are observing is signal rather than noise. Unfortunately, the field has been slow to appreciate the importance of sample size and many studies are just too underpowered to be convincing.

3. Gilbert et al raise the possibility that the effects that are observed are not just small but also more fragile, in that they can be very dependent on contextual factors. Get these wrong, and you lose the effect. Where this occurs, I think we should regard it as an opportunity, rather than a problem, because manipulating experimental conditions to discover how they influence an effect can be the key to understanding it. It can be difficult to distinguish a fragile effect from a false positive, and it is understandable that this can lead to ill-will between original researchers and those who fail to replicate their finding. But the rational response is not to dismiss the failure to replicate, but to first do adequately powered studies to demonstrate the effect and then conduct further studies to understand the boundary conditions for observing the phenomenon. To take one of the examples I used above, the link between phonological awareness and learning to read is particularly striking in English and less so in some other languages. Comparisons between languages thus provide a rich source of information for understanding how children become literate. Another of the effects, the right ear advantage in dichotic listening holds at the population level, but there are individuals for whom it is absent or reversed. Understanding this variability is part of the research process.
4. Psychology, unlike many other biomedical disciplines, involves training in statistics. In principle, this is thoroughly good thing, but in practice it can be a disaster if the psychologist is simply fixated on finding p-values less than .05 – and assumes that any effect associated with such a p-value is true. I’ve blogged about this extensively, so won’t repeat myself here, other than to say that statistical training should involve exploring simulated datasets so that the student starts to appreciate the ease with which low p-values can occur by chance when one has a large number of variables and a flexible approach to data analysis. Virtually all psychologists misunderstand p-values associated with interaction terms in analysis of variance – as I myself did until working with simulated datasets. I think in the past this was not such an issue, simply because it was not so easy to conduct statistical analyses on large datasets – one of my early papers describes how to compare regression coefficients using a pocket calculator, which at the time was an advance on other methods available! If you have to put in hours of work calculating statistics by hand, then you think hard about the analysis you need to do. Currently, you can press a few buttons on a menu and generate a vast array of numbers – which can encourage the researcher to just scan the output and highlight those where p falls below the magic threshold of .05. Those who do this are generally unaware of how problematic this is, in terms of raising the likelihood of false positive findings.
Nosek et al have demonstrated that much work in psychology is not reproducible in the everyday sense that if I try to repeat your experiment I can be confident of getting the same effect. Implicit in the critique by Gilbert et al is the notion that many studies are focused on effects that are both small and fragile, and so it is to be expected they will be hard to reproduce. They may well be right, but if so, the solution is not to deny we have a problem, but to recognise that under those circumstances there is an urgent need for our field to tackle the methodological issues of inadequate power and p-hacking, so we can distinguish genuine effects from false positives.


Wednesday, 2 March 2016

On the need for clarity of purpose in the REF and TEF

©CartoonStock.com


The UK’s Research Evaluation Framework (REF) has come in for a lot of criticism. It is now under review by a panel chaired by Nicholas Stern, with a call for evidence that closes later this month. At the same time, we have a Green Paper setting out plans  for a Teaching Excellence Framework (TEF). This is motivated in part by the view that the attention given to research and teaching has got out of balance. REF has provided universities with strong incentives to put resources into research, and teaching has consequently been neglected, goes the argument (though see here). So what do we need to even things up? A TEF.

The problem for both REF and TEF is that, at the end of the day, they aim for a single scale on which universities can be rank ordered so we can compare quality. But everyone agrees that the things we are measuring, research and teaching excellence, are complex and multifactorial.
There are basically two ways forward. Option A is to use some kind of proxy measure, recognising its limitations but taking the view that it is good enough for purpose. Option B involves trying to measure the complex multifactorial construct in all its richness.
There are a number of factors that influence choice of approach. Because everyone recognises that things are complex, Option A is unlikely to be acceptable to the academic community. Simple measures are often easy to game. On the other hand, the complex multifactorial measures of Option B can be debated endlessly, often involve elements of subjective judgement, are not immune to gaming, can be extremely expensive to administer, and can be hard to integrate into a single ranking.
James Wilsdon has noted with regard to the REF, before deciding which system of measurement to use, we have to have a clear idea of what we are trying to achieve.  As far as the REF goes, its purpose has changed and mutated over the years. It started out with a pretty simple goal: to find a formula to determine allocation of quality-related (QR) funding from central government to universities. However, as Wilsdon notes, it has subsequently been used for four additional purposes: to demonstrate accountability, to provide a measure of reputation, to influence research culture, and as a tool within universities for managing academics. He notes that: “If all we want from the REF is a QR allocation tool, then we can certainly do that in an algorithmic, metric-based way”(i.e. Option A). But he argues the REF needs to fulfil the other functions too, and, as was amply demonstrated in his report the Metric Tide, for those other purposes, a simple metrics-based system is inadequate.
I agree with much of what Wilsdon says, but I think we could save ourselves a lot of trouble by reverting to the original purpose of the REF, i.e. treat it purely as a mechanism for allocating funding. As I have argued previously, if that is all you want to do, then you don’t even need to bother with metrics of the kind discussed in his report. A simple measure of the number of active researchers present in a department gives a remarkably high correlation with the amount of QR funding received – and this works well for most subjects in arts and humanities as well as sciences.
But what about gaming? When I proposed this idea a couple of years ago, people said, wouldn’t universities just designate the departmental cleaner as an active researcher, or take on more research staff? I don’t see these problems as insuperable. It would be important to specify stringent criteria for research staff to meet: these would include terms of employment (casual staff would be excluded), as well as evidence of research activity. If one counted only those staff who had been employed at the institution for some minimum period, such as 3-4 years, this should prevent institutions catapulting in overseas researchers on Mickey Mouse contracts, or taking on short-term staff to give a temporary blip in researcher numbers.
A more serious objection to my proposal is that there is no explicit measure of research quality – an institution could take on a large number of weak researchers and look as good as a competitor with an equal number of excellent researchers. But would this happen? Remember, researchers would need to be on the institutional payroll for a period of 3-4 years prior to the evaluation, so the institution would need to commit to the expense of employing them. This would not be worthwhile if staff then failed to meet the criteria set for research-active staff. Academics who did not count as active researchers would end up being a net cost to the institution.
I’m not saying that it would be easy to fine-tune such a system to avoid gaming or unintended consequences, just that it could be done, and I suspect would be much less difficult than devising an entirely separate system for evaluating research quality.
My case falls apart if, like Wilsdon (and many other people who have been involved in REF) you think REF should fulfil additional purposes. Then, because no one measure is suitable for all purposes, you need something much more complicated. But I do agree with Wilsdon that, if that’s what you want, you need to be clear about it – and about the need for a diverse set of measures appropriate to different goals.
What about TEF? Well, when you dig beneath the surface, you find that the parallels between REF and TEF are purely superficial. The purpose of TEF is not to allocate funding – there is no funding to allocate. The stated purposes are as complex and multifactorial as the notion of teaching excellence itself: to help students select courses, to increase access of under-represented groups to higher education, to provide a basis for allowing universities to raise fees, and to provide criteria for ‘new entrants’ (i.e. private institutions) that wish to enter the higher education market. According to a recent BIS Select Committee report, it’s also intended to provide incentives: “to ensure that higher education institutions meet student expectations and improve on their leading international position.” Quite what it means to improve on a leading international position is not specified.
In attempting to develop a measure that will cover all these functions, those promoting the TEF have tied themselves in knots, as illustrated by this wonderfully circular statement from the same Select Committee report:
In the absence of any agreed definition or recognised measures of teaching quality, the Government is proposing to use measures, or metrics, as proxies for teaching quality. Therefore the challenge is to identify those metrics which most reliably and accurately measure teaching quality, as opposed to other factors that contribute to the results achieved by students.
This is worrying. The only positive thing one can say is that there are signs that government may be starting to recognise some of the problems. The Select Committee report cautions the need not to rush into a TEF, and notes reservations both about the measures proposed and the proposed link between TEF and fee-raising powers. The report concludes by encouraging academics to work with BIS to develop appropriate metrics for TEF – the impression is that government is aware if they get it wrong then universities may just decide not to play ball. One of the members of the Select Committee, Amanda Milling, wrote in the Times Higher that “the higher education sector has a responsibility to engage with TEF to make it work.”
But do we? I would argue that the responsibility lies with the Minister, to make a proper case for the TEF.
As the Select Committee report points out: “It is important to note the high quality of teaching generally available in our higher education system at present…..The debate around teaching excellence should therefore be viewed within the context of enhancing an already excellent system or, as the Minister for Universities and Science put it, ‘to continue to make a great sector greater still’”. These weasel words mean that if universities resist TEF, they can be accused of complacency. But where’s the evidence that TEF will ‘make a great sector greater still’? A considerable amount of time and money will be sucked up by this exercise, which has multiple confused aims and has potential to tie up a great sector in pointless bureaucracy and waffle. The whole idea is seriously misconceived and has been rushed through without adequate justification or cost-benefit analysis.
We are now being told that TEF will be introduced by degrees, with measures being developed over time, but I am not reassured. If the government wants academics on side, it needs to demonstrate more coherent arguments, with clear specification of the goals of the TEF, and evidence of validity of the measures it proposes to achieve those goals. And most of all, it needs to show us that more good than harm will result from this exercise.