Showing posts with label methods. Show all posts
Showing posts with label methods. Show all posts

Saturday, 5 March 2016

There is a reproducibility crisis in psychology and we need to act on it


The Müller-Lyer illusion: a highly reproducible effect. The central lines are the same length but the presence of the fins induces a perception that the left-hand line is longer.

The debate about whether psychological research is reproducible is getting heated. In 2015, Brian Nosek and his colleagues in the Open Science Collaboration showed that they could not replicate effects for over 50 per cent of studies published in top journals. Now we have a paper by Dan Gilbert and colleagues saying that this is misleading because Nosek’s study was flawed, and actually psychology is doing fine. More specifically: “Our analysis completely invalidates the pessimistic conclusions that many have drawn from this landmark study.” This has stimulated a set of rapid responses, mostly in the blogosphere. As Jon Sutton memorably tweeted: “I guess it's possible the paper that says the paper that says psychology is a bit shit is a bit shit is a bit shit.”
So now the folks in the media are confused and don’t know what to think.
The bulk of debate has been focused on what exactly we mean by reproducibility in statistical terms. That makes sense because many of the arguments hinge on statistics, but I think that ignores the more basic issue, which is whether psychology has a problem. My view is that we do have a problem, though psychology is no worse than many other disciplines that use inferential statistics.
In my undergraduate degree I learned about stuff that was on the one hand non-trivial and on the other hand solidly reproducible. Take for instance, various phenomena in short-term memory. Effects like the serial position effect, the phonological confusability effect, the superiority of memory for words over nonwords, are solid and robust. In perception, we have striking visual effects such as the Müller-Lyer illusion, which demonstrate how our eyes can deceive us. In animal learning, the partial reinforcement effect is solid. In psycholinguistics, the difficulty adults have discriminating sound contrasts that are not distinctive in their native language is solid. In neuropsychology, the dichotic right ear advantage for verbal material is solid. In developmental psychology, it has been shown over and over again that poor readers have deficits in phonological awareness. These are just some of the numerous phenomena studied by psychologists that are reproducible in the sense that most people understand it, i.e. if I were to run an undergraduate practical class to demonstrate the effect, I’d be pretty confident that we’d get it. They are also non-trivial, in that a lay person would not just conclude that the result could have been predicted in advance.
The Reproducibility Project showed that many effects described in contemporary literature are not like that. But was it ever thus? I’d love to see the reproducibility project rerun with psychology studies reported in the literature from the 1970s – have we really got worse, or am I aware of the reproducible work just because that stuff has stood the test of time, while other work is forgotten?
My bet is that things have got worse, and I suspect there are a number of reasons for this:
1. Most of the phenomena I describe above were in areas of psychology where it was usual to report a series of experiments that demonstrated the effect and attempted to gain a better understanding of it by exploring the conditions under which it was obtained. Replication was built in to the process. That is not common in many of the areas where reproducibility of effects is contested.
2. It’s possible that all the low-hanging fruit has been plucked, and we are now focused on much smaller effects – i.e., where the signal of the effect is low in relation to background noise. That’s where statistics assumes importance. Something like the phonological confusability effect in short-term memory or a Müller-Lyer illusion is so strong that it can be readily demonstrated in very small samples. Indeed, abnormal patterns of performance on short-term memory tests can be used diagnostically with individual patients. If you have a small effect, you need much bigger samples to be confident that what you are observing is signal rather than noise. Unfortunately, the field has been slow to appreciate the importance of sample size and many studies are just too underpowered to be convincing.

3. Gilbert et al raise the possibility that the effects that are observed are not just small but also more fragile, in that they can be very dependent on contextual factors. Get these wrong, and you lose the effect. Where this occurs, I think we should regard it as an opportunity, rather than a problem, because manipulating experimental conditions to discover how they influence an effect can be the key to understanding it. It can be difficult to distinguish a fragile effect from a false positive, and it is understandable that this can lead to ill-will between original researchers and those who fail to replicate their finding. But the rational response is not to dismiss the failure to replicate, but to first do adequately powered studies to demonstrate the effect and then conduct further studies to understand the boundary conditions for observing the phenomenon. To take one of the examples I used above, the link between phonological awareness and learning to read is particularly striking in English and less so in some other languages. Comparisons between languages thus provide a rich source of information for understanding how children become literate. Another of the effects, the right ear advantage in dichotic listening holds at the population level, but there are individuals for whom it is absent or reversed. Understanding this variability is part of the research process.
4. Psychology, unlike many other biomedical disciplines, involves training in statistics. In principle, this is thoroughly good thing, but in practice it can be a disaster if the psychologist is simply fixated on finding p-values less than .05 – and assumes that any effect associated with such a p-value is true. I’ve blogged about this extensively, so won’t repeat myself here, other than to say that statistical training should involve exploring simulated datasets so that the student starts to appreciate the ease with which low p-values can occur by chance when one has a large number of variables and a flexible approach to data analysis. Virtually all psychologists misunderstand p-values associated with interaction terms in analysis of variance – as I myself did until working with simulated datasets. I think in the past this was not such an issue, simply because it was not so easy to conduct statistical analyses on large datasets – one of my early papers describes how to compare regression coefficients using a pocket calculator, which at the time was an advance on other methods available! If you have to put in hours of work calculating statistics by hand, then you think hard about the analysis you need to do. Currently, you can press a few buttons on a menu and generate a vast array of numbers – which can encourage the researcher to just scan the output and highlight those where p falls below the magic threshold of .05. Those who do this are generally unaware of how problematic this is, in terms of raising the likelihood of false positive findings.
Nosek et al have demonstrated that much work in psychology is not reproducible in the everyday sense that if I try to repeat your experiment I can be confident of getting the same effect. Implicit in the critique by Gilbert et al is the notion that many studies are focused on effects that are both small and fragile, and so it is to be expected they will be hard to reproduce. They may well be right, but if so, the solution is not to deny we have a problem, but to recognise that under those circumstances there is an urgent need for our field to tackle the methodological issues of inadequate power and p-hacking, so we can distinguish genuine effects from false positives.


Friday, 26 July 2013

Why we need pre-registration


There has been a chorus of disapproval this week at the suggestion that researchers should 'pre-register' their studies with journals and spell out in advance the methods and analyses that they plan to do. Those who wish to follow the debate should look at this critique by Sophie Scott, with associated comments, and the responses to it collated here by Pete Etchells. They should also read the explanation of the pre-registration proposals and FAQ  by Chris Chambers - something that many participants in the debate appear not to have done.

Quite simply, pre-registration is designed to tackle two problems in scientific publishing:
  • Bias against publication of null results
  • A failure to distinguish hypothesis-generating (exploratory) from hypothesis-testing analyses
Either of these alone is bad for science: the combined effect of both of them is catastrophic, and has led to a situation where research is failing to do its job in terms of providing credible answers to scientific questions.

Null results

Let's start with the bias against null results. Much has been written about this, including by me. But the heavy guns in the argument have been wielded by Ben Goldacre, who has pointed out that, in the clinical trials field, if we only see the positive findings, then we get a completely distorted view of what works, and as a result, people may die. In my field of psychology, the stakes are not normally as high, but the fact remains that there can be massive distortion in our perception of evidence.

Pre-registration would fix this by guaranteeing publication of a paper regardless of how the results turn out. In fact, there is another, less bureaucratic, way the null result problem could be fixed, and that would be by having reviewers decide on a paper's publishability solely on the basis of the introduction and methods. But that would not fix the second problem.

Blurring the boundaries between exploratory and hypothesis-testing analyses

A big problem is that nearly all data analysis is presented as if it is hypothesis-testing when in fact much of it is exploratory.

In an exploratory analysis, you take a dataset and look at it flexibly to see what's there. Like many scientists, I love exploratory analyses, because you don't know what you will find, and it can be important and exciting. I suspect it is also something that you get better at as you get more experienced, and more able to see the possibilities in the numbers. But my love of exploratory analyses is coupled with a nervousness. With an exploratory analysis, whatever you find, you can never be sure it wasn't just a chance result. Perhaps I was lucky in having this brought home to me early in my career, when I had an alphabetically ordered list of stroke patients I was planning to study, and I happened to notice that those with names in the first half of the alphabet  had left hemisphere lesions and those with names in the second half had right hemisphere lesions. I even did a chi square test and found it was highly significant. Clearly this was nonsense, and just one of those spurious things that can turn up by chance.

These days it is easy to see how often meaningless 'significant' results occur by running analyses on simulated data - see this blogpost for instance. In my view, all statistics classes should include such exercises.

So you've done your exploratory analysis, got an exciting finding, but are nervous as to whether it is real. What do you do? The answer is you need a confirmatory study. In the field of genetics, failure to realise this led to several years of stasis, cogently described by Flint et al (2010). Genetics really highlights the problem, because of the huge numbers of possible analyses that can be conducted. What was quickly learned was that most exciting effects don't replicate. The bar has accordingly been set much higher, and most genetics journals won't consider publishing a genetic association unless replication has been demonstrated (Munafo & Flint, 2011). This is tough, but it has meant that we can now place confidence in genetics results. (It also has had a positive side-effect of encouraging more collaboration between research groups). Unfortunately, those outside the field of genetics are unaware of these developments, and we are seeing increasing numbers of genetic association studies being published in the neuroscience literature, with tiny samples and no replication.

The important point to grasp is that the meaning of a p-value is completely different if it emerges when testing an a priori prediction, compared with when it is found in the course of conducting numerous analyses of a dataset. Here, for instance, are outputs from 15 runs of a 4-way Anova on random data, as described here:
Each row shows p-value for outputs (main effects then interactions) for one run of 4-way Anova on new set of random data. For a slightly more legible version see here

If I approached a dataset specifically testing the hypothesis that there would be an interaction between group and task, then the chance of a p-value of .05 or less would be 1 in 20  (as can be confirmed by repeating the simulation thousands of times - in a small number of runs it's less easy to see). But if I just looked for significant findings, it's not hard to find something on most of these runs. An exploratory analysis is not without value, but its value is in generating hypotheses that can then be tested in an a priori design.

So replication is needed to deal with the uncertainties around exploratory analysis. How does pre-registration fit in the picture? Quite simply, it makes explicit the distinction between hypothesis-generating (exploratory) and hypothesis-testing research, which is currently completely blurred. As in the example above, if you tell me in advance what hypothesis you are testing, then I can place confidence in the uncorrected statistical probabilities associated with the predicted effects.  If you haven't predicted anything in advance, then I can't.

This doesn't mean that the results from exploratory analyses are necessarily uninteresting, untrue, or unpublishable, but it does mean we should interpret them as what they are: hypothesis-generating rather than hypothesis-testing.

I'm not surprised at the outcry against pre-registration. This is mega. It would require most of us to change our behaviour radically. It would turn on its head the criteria used to evaluate findings: well-conducted replication studies, currently often unpublishable,  would be seen as important, regardless of their results. On the other hand, it would no longer be possible to report exploratory analyses as if they are hypothesis-testing. In my view, unless we do this we will continue to waste time and precious research funding chasing illusory truths.

References

Flint, J., Greenspan, R. J., & Kendler, K. S. (2010). How Genes Influence Behavior: Oxford University press.

Munafo, M, & Flint, J. (2011). Dissecting the genetic architecture of human personality Trends in Cognitive Sciences, 15 (9), 395-400 DOI: 10.1016/j.tics.2011.07.007

Friday, 11 January 2013

Genetic variation and neuroimaging: some ground rules for reporting research



Those who follow me on Twitter may have noticed signs of tetchiness in my tweets over the past few weeks. In the course of writing a review article, I’ve been reading papers linking genetic variants to language-related brain structure and function. This has gone more slowly than I expected for two reasons. First, the literature gets ever more complicated and technical: both genetics and brain imaging involve huge amounts of data, and new methods for crunching the numbers are developed all the time. If you really want to understand a paper, rather than just assuming the Abstract is accurate, it can be a long, hard slog, especially if, like me, you are neither a geneticist nor a neuroimager. That’s understandable and perhaps unavoidable. The other reason, though, is less acceptable. For all their complicated methods, many of the papers in this area fail to tell the reader some important and quite basic information. This is where the tetchiness comes in. Having burned my brains out trying to understand what was done, I then realise that I have no idea about something quite basic like the sample size. The initial assumption is that I’ve missed it, and so I wade through the paper again, and the Supplementary Material, looking for the key information. Only when I’m absolutely certain that it’s not there, am I reduced to writing to the authors for the information. So this is a plea – to authors, editors and reviewers. If a paper is concerned with an association between a genetic variant and a phenotype (in my case the interest is in neural phenotypes, but I suspect this applies more widely) then could we please ensure that the following information is clearly reported in the Methods or Results section

1. What genetic variant are we talking about? You might think this is very simple, but it’s not: for instance, one of the genes I’m interested in is CNTNAP2, which has been associated with a range of neurodevelopmental disorders, especially those affecting language. The evidence for a link between CNTNAP2 and developmental disorders comes from studies that have examined variation in single-nucleotide polymorphisms or SNPs. These are segments of DNA that are useful in revealing differences between people because they are highly variable. DNA is composed of four bases, C, T, G, and A in paired strands. So for instance, we might have a locus where some people have two copies of C, some have two copies of T, and others have a C and a T. SNPs are not  necessarily a functional part of the gene itself – they may be in a non-coding region, or so close to a gene that variation in the SNP co-occurs with variation in the gene. Many different SNPs can index the same gene. So for CNTNAP2, Vernes et al (2008)tested 38 SNPs, ten of which were linked to language problems. So we have to decide which SNP to study – or whether to study all of them. And we have to decide how to do the analysis. For instance, SNP rs2710102 can take the form CC, CT or TT. We could look for a dose response effect (CC < CT < TT) or we could compare CC/CT with TT, or we could compare CC with CT/TT. Which of these we do may depend on whether prior research suggests the genetic effect is additive or dominant, but for brain imaging studies grouping can also be dictated by practical considerations: it’s usual to compare just two groups and to combine genotypes to give a reasonable sample size. If you’ve followed me so far, and you have some background in statistics, you will already be starting to see why this is potentially problematic. If the researcher can select from ten possible SNPs, and two possible analyses, the opportunities for finding spuriously ‘significant’ results are increased. If there are no directional predictions – i.e. we are just looking for a difference between two groups, but don’t have a clear idea of what type of difference will be associated with ‘risk’ – then the number of potentially ‘interesting’ results is doubled.
For CNTNAP2, I found two papers that had looked at brain correlates of SNP rs2710102. Whalley et al (2011) found that adults with the CC genotype had different patterns of brain activation from CT/TT individuals. However, the other study, by Scott-van Zeeland et al (2010), treated CC/CT as a risk genotype that was compared with TT. (This was not clear in the paper, but the authors confirmed it was what they did).
 Four studies looked at another SNP - rs7794745, on the basis that an increased risk of autism had been reported for the T allele in males. Two of them (Tan et al, 2010; Whalley et al, 2010) compared TT vs TA/AA and two (Folia et al, 2011; Kos et al, 2012) compared TT/TA with AA. In any case, the ground is rather cut from under the feet of these researchers by a recent failure to replicate an association of this SNP with autism (Anney et al, 2012).

2. Who are the participants? It’s not very informative to just say you studied “healthy volunteers”. There are some types of study where it doesn’t much matter how you recruited people. A study looking at genetic correlates of cognitive ability isn’t one of them. Samples of university students, for instance, are not representative of the general population, and aren’t likely to include many people with significant language problems.

3. How many people in the study had each type of genetic variant? And if subgroup analyses are reported, how many people in each subgroup had each type of genetic variant? I've found that papers in top-notch journals often fail to provide this basic information.
Why is this important? For a start, likelihood of showing significant activation of a brain region will be affected by sample size. Suppose you have 24 people with genotype A and 8 with genotype B. You find significant activation of brain region X in those with genotype A, but not for those with genotype B. If you don’t do an explicit statistical comparison of groups (you should - but many people don’t) you may be misled into concluding that brain activation is defective in genotype B – when in fact you just have low power to detect effects in that group because it is so small.
In addition, if you don’t report the N, then it’s difficult to get an idea of the effect size and confidence interval for any effect that is reported. The reasons why this is optimal are well-articulated here. This issue has been much discussed in psychology, but seems not to have permeated the field of genetics, where reliance on p-values seems the norm. In neuroimaging it gets particularly complicated, because some form of correction for ‘false discovery’ will be applied when multiple comparisons are conducted. It’s often hard to work out quite how this was done, and you can end up staring at a table that shows brain regions and p-values, with only a vague idea of how big a difference there actually is between groups.
 Most of the SNPs that are being used in brain studies are ones that were found to be associated with a behavioural phenotype in large-scale genomic studies where the sample size would include hundreds if not thousands of individuals, so small effects could be detected. Brain-based studies often use sample sizes that are relatively small, but some of them find large, sometimes very large, effects. So what does that mean? The optimistic interpretation is that a brain-based phenotype is much closer to the gene effect, and so gives clearer findings. This is essentially  the argument used by those who talk of ‘endophenotypes’ or ‘biomarkers’. There is, however, an alternative, and much more pessimistic view, which is that studies linking genotypes with brain measures are prone to generate false positive findings, because there are too many places in the analysis pipeline where the researchers have opportunities to pick and choose the analysis that brings out the effect of interest most clearly. Neuroskeptic has a nice blogpost illustrating this well-known problem in the neuroimaging area; matters are only made worse by uncertainty re SNP classification (point 1).
A source of concern here is the unpublishability of null findings. Suppose you did a study where you looked at, say, 40 SNPs and a range of measures of brain structure, covering the whole brain. After doing appropriate corrections for multiple comparisons, nothing is significant. The sad fact is that your study is unlikely to find a home in a journal. But is this right? After all, we don’t want to clutter up the literature with a load of negative results. The answer depends on your sample size, among other things. In a small sample, a null result might well reflect lack of statistical power to detect a small effect. This is precisely why people should avoid doing small studies: if you find nothing, it’s uninterpretable. What we need are studies that allow us to say with confidence whether or not there is a significant gene effect.

4. How do the genetic/neuroimaging results relate to cognitive measures in your sample?  Your notion that ‘underactivation of brain area X’ is an endophenotype that leads to poor language, for instance, doesn’t look very plausible if people who have such underactivation have excellent language skills. Out of five papers on CNTNAP2 that I reviewed, three made no mention of cognitive measures, one gathered cognitive data but did not report how it related to genotype or brain measures, and only one provided some relevant, though sketchy, data.

5. Report negative findings. The other kind of email I’ve been writing to people is one that says – could you please clarify whether your failure to report on the relationship between X and Y was because you didn’t do that analysis, or whether you did the analysis but failed to find anything. This is going to be an uphill battle, because editors and reviewers often advise authors to remove analyses with nonsignificant findings. This is a very bad idea as it distorts the literature.


And last of all....
A final plea is not so much to journal editors as to press officers. Please be aware that studies of common SNPs aren't the same as studies of rare genetic mutations. The genetic variants in the studies I looked at were all relatively common in the general population, and so aren't going to be associated with major brain abnormalities. Sensationalised press releases can only cause confusion:
This release on the Scott van-Zeeland (2010) study described neuroimaging findings from  CNTNAP2 variants that are found in over 70% of the population. It claims that: 
  • “A gene variant tied to autism rewires the brain"
  • "Now we can begin to unravel the mystery of how genes rearrange the brain's circuitry, not only in autism but in many related neurological disorders."
  • “Regardless of their diagnosis, the children carrying the risk variant showed a disjointed brain. The frontal lobe was over-connected to itself and poorly connected to the rest of the brain”
  • "If we determine that the CNTNAP2 variant is a consistent predictor of language difficulties, we could begin to design targeted therapies to help rebalance the brain and move it toward a path of more normal development."
Only at the end of the press release, are we told that "One third of the population [sic: should be two thirds] carries this variant in its DNA. It's important to remember that the gene variant alone doesn't cause autism, it just increases risk." 

References
Anney, R., Klei, L., Pinto, D., Almeida, J., Bacchelli, E., Baird, G., . . . Devlin, B. . Individual common variants exert weak effects on the risk for autism spectrum disorders. Human Molecular Genetics, 21(21), 4781-4792. doi: 10.1093/hmg/dds301(2012)
V. Folia, C. Forkstam, M. Ingvar, P. Hagoort, K. M. Petersson, Implicit artificial syntax processing: Genes, preference, and bounded recursion. Biolinguistics 5,  (2011).

M. Kos et al., CNTNAP2 and language processing in healthy individuals as measured with ERPs. PLOS One 7,  (2012).
Scott-Van Zeeland, A., Abrahams, B., Alvarez-Retuerto, A., Sonnenblick, L., Rudie, J., Ghahremani, D., Mumford, J., Poldrack, R., Dapretto, M., Geschwind, D., & Bookheimer, S. (2010). Altered Functional Connectivity in Frontal Lobe Circuits Is Associated with Variation in the Autism Risk Gene CNTNAP2 Science Translational Medicine, 2 (56), 56-56 DOI: 10.1126/scitranslmed.3001344

G. C. Tan, T. F. Doke, J. Ashburner, N. W. Wood, R. S. Frackowiak, Normal variation in fronto-occipital circuitry and cerebellar structure with an autism-associated polymorphism of CNTNAP2. Neuroimage 53, 1030 (2010).

Vernes, S. C., Newbury, D. F., Abrahams, B., Winchester, L., Nicod, J., Groszer, M., . . . Fisher, S.  A functional genetic link between distinct developmental language disorders. New England Journal of Medicine, 359, 2337-2345. (2008).

H. C. Whalley et al., Genetic variation in CNTNAP2 alters brain function during linguistic processing in healthy individuals. Am. J. Med. Genet. B 156B, 941 (2011).