Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Friday, 3 November 2017

Prisons, developmental language disorder, and base rates

There's been some interesting discussion on Twitter about the high rate of developmental language disorder (DLD) in the prison population. Some studies give an estimate as high as 50 percent (Anderson et al, 2016), and this has prompted calls for speech-language therapy services to be involved in the working with offenders. Work by Pam Snow and others has documented the difficulties of navigating the justice system if your understanding and ability to express yourself are limited.

This is important work, but I have worried from time to time about the potential for misunderstanding. In particular, if you are a parent of a child with DLD, should you be alarmed at the prospect that your offspring will be incarcerated? So I wanted to give a brief explainer that offers some reassurance.

The simplest way to explain it is to think about gender. I've been delving into the latest national statistics for this post, and found that the UK prison population this year contained 82,314 men, but a mere 4,013 women. That's a staggering difference, but we don't conclude that because most criminals are men, therefore most men are criminals. This is because we have to take into account base rates: the proportion of the general population who are in prison. Another set of government statistics estimates the UK population as around 64.6 million, about half of whom are male, and 81% are adults. So a relatively small proportion of the adult population is in prison, and the numbers of non-criminal men vastly outnumber the number of criminal men.

I did similar sums for DLD, using data from Norbury et al (2016) to estimate a population prevalence of 7% in adult males, and plugging in that relatively high figure of 50% of prisoners with DLD. The figures look like this.


Numbers (in thousands) assuming 7% prevalence of DLD and 50% DLD in prisoners*
As you can see, according to this scenario, the probability of going to prison is much greater for those with DLD than for those without DLD (2.24% DLD vs 0.17% without DLD), but the absolute probability is still very low – 98% of those with DLD will not be incarcerated.

The so-called base rate fallacy is a common error in logical reasoning. It seems natural to conclude that if A is associated with B, then B must be associated with A. Statistically, that is true, but if A is extremely rare, then the likelihood of B given A can be considerably less than the likelihood of A given B.

So I don't think therefore that we need to seek explanations for the apparent inconsistency that's being flagged up on Twitter between rates of incarceration in studies of those with DLD, vs rates of DLD in those who are incarcerated. It could just be the consequence of the low base rate of incarceration.

References
Anderson et al (2016) Language impairments among youth offenders: A systematic review. Children and Youth Services Review, 65, 195-203.

Norbury, C. F.,  et al. (2016). The impact of nonverbal ability on prevalence and clinical presentation of language disorder: evidence from a population study. Journal of Child Psychology and Psychiatry, 57, 1247-1257.

*An R script for generating this figure can be found here.


Postscript - 4th November 2017 
The Twitter discussion has continued and drawn attention to further sources of information on rates of language and related problems in prison populations. Happy to add these here if people can send sources:

Talbot, J. (2008). No One Knows: Report and Final Recommendations. Report by Prison Reform Trust.  

House of Commons Justice Committee (2016) The Treatment of Young Adults in the Criminal Justice System.  Report HC 169.

Sunday, 10 September 2017

Bishopblog catalogue (updated 10 Sept 2017)

Source: http://www.weblogcartoons.com/2008/11/23/ideas/

Those of you who follow this blog may have noticed a lack of thematic coherence. I write about whatever is exercising my mind at the time, which can range from technical aspects of statistics to the design of bathroom taps. I decided it might be helpful to introduce a bit of order into this chaotic melange, so here is a catalogue of posts by topic.

Language impairment, dyslexia and related disorders
The common childhood disorders that have been left out in the cold (1 Dec 2010) What's in a name? (18 Dec 2010) Neuroprognosis in dyslexia (22 Dec 2010) Where commercial and clinical interests collide: Auditory processing disorder (6 Mar 2011) Auditory processing disorder (30 Mar 2011) Special educational needs: will they be met by the Green paper proposals? (9 Apr 2011) Is poor parenting really to blame for children's school problems? (3 Jun 2011) Early intervention: what's not to like? (1 Sep 2011) Lies, damned lies and spin (15 Oct 2011) A message to the world (31 Oct 2011) Vitamins, genes and language (13 Nov 2011) Neuroscientific interventions for dyslexia: red flags (24 Feb 2012) Phonics screening: sense and sensibility (3 Apr 2012) What Chomsky doesn't get about child language (3 Sept 2012) Data from the phonics screen (1 Oct 2012) Auditory processing disorder: schisms and skirmishes (27 Oct 2012) High-impact journals (Action video games and dyslexia: critique) (10 Mar 2013) Overhyped genetic findings: the case of dyslexia (16 Jun 2013) The arcuate fasciculus and word learning (11 Aug 2013) Changing children's brains (17 Aug 2013) Raising awareness of language learning impairments (26 Sep 2013) Good and bad news on the phonics screen (5 Oct 2013) What is educational neuroscience? (25 Jan 2014) Parent talk and child language (17 Feb 2014) My thoughts on the dyslexia debate (20 Mar 2014) Labels for unexplained language difficulties in children (23 Aug 2014) International reading comparisons: Is England really do so poorly? (14 Sep 2014) Our early assessments of schoolchildren are misleading and damaging (4 May 2015) Opportunity cost: a new red flag for evaluating interventions (30 Aug 2015) The STEP Physical Literacy programme: have we been here before? (2 Jul 2017)

Autism
Autism diagnosis in cultural context (16 May 2011) Are our ‘gold standard’ autism diagnostic instruments fit for purpose? (30 May 2011) How common is autism? (7 Jun 2011) Autism and hypersystematising parents (21 Jun 2011) An open letter to Baroness Susan Greenfield (4 Aug 2011) Susan Greenfield and autistic spectrum disorder: was she misrepresented? (12 Aug 2011) Psychoanalytic treatment for autism: Interviews with French analysts (23 Jan 2012) The ‘autism epidemic’ and diagnostic substitution (4 Jun 2012) How wishful thinking is damaging Peta's cause (9 June 2014)

Developmental disorders/paediatrics
The hidden cost of neglected tropical diseases (25 Nov 2010) The National Children's Study: a view from across the pond (25 Jun 2011) The kids are all right in daycare (14 Sep 2011) Moderate drinking in pregnancy: toxic or benign? (21 Nov 2012) Changing the landscape of psychiatric research (11 May 2014)

Genetics
Where does the myth of a gene for things like intelligence come from? (9 Sep 2010) Genes for optimism, dyslexia and obesity and other mythical beasts (10 Sep 2010) The X and Y of sex differences (11 May 2011) Review of How Genes Influence Behaviour (5 Jun 2011) Getting genetic effect sizes in perspective (20 Apr 2012) Moderate drinking in pregnancy: toxic or benign? (21 Nov 2012) Genes, brains and lateralisation (22 Dec 2012) Genetic variation and neuroimaging (11 Jan 2013) Have we become slower and dumber? (15 May 2013) Overhyped genetic findings: the case of dyslexia (16 Jun 2013) Incomprehensibility of much neurogenetics research ( 1 Oct 2016) A common misunderstanding of natural selection (8 Jan 2017) Sample selection in genetic studies: impact of restricted range (23 Apr 2017)

Neuroscience
Neuroprognosis in dyslexia (22 Dec 2010) Brain scans show that… (11 Jun 2011)  Time for neuroimaging (and PNAS) to clean up its act (5 Mar 2012) Neuronal migration in language learning impairments (2 May 2012) Sharing of MRI datasets (6 May 2012) Genetic variation and neuroimaging (1 Jan 2013) The arcuate fasciculus and word learning (11 Aug 2013) Changing children's brains (17 Aug 2013) What is educational neuroscience? ( 25 Jan 2014) Changing the landscape of psychiatric research (11 May 2014) Incomprehensibility of much neurogenetics research ( 1 Oct 2016)

Reproducibility
Accentuate the negative (26 Oct 2011) Novelty, interest and replicability (19 Jan 2012) High-impact journals: where newsworthiness trumps methodology (10 Mar 2013) Who's afraid of open data? (15 Nov 2015) Blogging as post-publication peer review (21 Mar 2013) Research fraud: More scrutiny by administrators is not the answer (17 Jun 2013) Pressures against cumulative research (9 Jan 2014) Why does so much research go unpublished? (12 Jan 2014) Replication and reputation: Whose career matters? (29 Aug 2014) Open code: note just data and publications (6 Dec 2015) Why researchers need to understand poker ( 26 Jan 2016) Reproducibility crisis in psychology ( 5 Mar 2016) Further benefit of registered reports ( 22 Mar 2016) Would paying by results improve reproducibility? ( 7 May 2016) Serendipitous findings in psychology ( 29 May 2016) Thoughts on the Statcheck project ( 3 Sep 2016) When is a replication not a replication? (16 Dec 2016) Reproducible practices are the future for early career researchers (1 May 2017) Which neuroimaging measures are useful for individual differences research? (28 May 2017) Prospecting for kryptonite: the value of null results (17 Jun 2017)  

Statistics
Book review: biography of Richard Doll (5 Jun 2010) Book review: the Invisible Gorilla (30 Jun 2010) The difference between p < .05 and a screening test (23 Jul 2010) Three ways to improve cognitive test scores without intervention (14 Aug 2010) A short nerdy post about the use of percentiles (13 Apr 2011) The joys of inventing data (5 Oct 2011) Getting genetic effect sizes in perspective (20 Apr 2012) Causal models of developmental disorders: the perils of correlational data (24 Jun 2012) Data from the phonics screen (1 Oct 2012)Moderate drinking in pregnancy: toxic or benign? (1 Nov 2012) Flaky chocolate and the New England Journal of Medicine (13 Nov 2012) Interpreting unexpected significant results (7 June 2013) Data analysis: Ten tips I wish I'd known earlier (18 Apr 2014) Data sharing: exciting but scary (26 May 2014) Percentages, quasi-statistics and bad arguments (21 July 2014) Why I still use Excel ( 1 Sep 2016) Sample selection in genetic studies: impact of restricted range (23 Apr 2017) Prospecting for kryptonite: the value of null results (17 Jun 2017)

Journalism/science communication
Orwellian prize for scientific misrepresentation (1 Jun 2010) Journalists and the 'scientific breakthrough' (13 Jun 2010) Science journal editors: a taxonomy (28 Sep 2010) Orwellian prize for journalistic misrepresentation: an update (29 Jan 2011) Academic publishing: why isn't psychology like physics? (26 Feb 2011) Scientific communication: the Comment option (25 May 2011)  Publishers, psychological tests and greed (30 Dec 2011) Time for academics to withdraw free labour (7 Jan 2012) 2011 Orwellian Prize for Journalistic Misrepresentation (29 Jan 2012) Time for neuroimaging (and PNAS) to clean up its act (5 Mar 2012) Communicating science in the age of the internet (13 Jul 2012) How to bury your academic writing (26 Aug 2012) High-impact journals: where newsworthiness trumps methodology (10 Mar 2013)  A short rant about numbered journal references (5 Apr 2013) Schizophrenia and child abuse in the media (26 May 2013) Why we need pre-registration (6 Jul 2013) On the need for responsible reporting of research (10 Oct 2013) A New Year's letter to academic publishers (4 Jan 2014) Journals without editors: What is going on? (1 Feb 2015) Editors behaving badly? (24 Feb 2015) Will Elsevier say sorry? (21 Mar 2015) How long does a scientific paper need to be? (20 Apr 2015) Will traditional science journals disappear? (17 May 2015) My collapse of confidence in Frontiers journals (7 Jun 2015) Publishing replication failures (11 Jul 2015) Psychology research: hopeless case or pioneering field? (28 Aug 2015) Desperate marketing from J. Neuroscience ( 18 Feb 2016) Editorial integrity: publishers on the front line ( 11 Jun 2016) When scientific communication is a one-way street (13 Dec 2016) Breaking the ice with buxom grapefruits: Pratiques de publication and predatory publishing (25 Jul 2017)

Social Media
A gentle introduction to Twitter for the apprehensive academic (14 Jun 2011) Your Twitter Profile: The Importance of Not Being Earnest (19 Nov 2011) Will I still be tweeting in 2013? (2 Jan 2012) Blogging in the service of science (10 Mar 2012) Blogging as post-publication peer review (21 Mar 2013) The impact of blogging on reputation ( 27 Dec 2013) WeSpeechies: A meeting point on Twitter (12 Apr 2014) Email overload ( 12 Apr 2016)

Academic life
An exciting day in the life of a scientist (24 Jun 2010) How our current reward structures have distorted and damaged science (6 Aug 2010) The challenge for science: speech by Colin Blakemore (14 Oct 2010) When ethics regulations have unethical consequences (14 Dec 2010) A day working from home (23 Dec 2010) Should we ration research grant applications? (8 Jan 2011) The one hour lecture (11 Mar 2011) The expansion of research regulators (20 Mar 2011) Should we ever fight lies with lies? (19 Jun 2011) How to survive in psychological research (13 Jul 2011) So you want to be a research assistant? (25 Aug 2011) NHS research ethics procedures: a modern-day Circumlocution Office (18 Dec 2011) The REF: a monster that sucks time and money from academic institutions (20 Mar 2012) The ultimate email auto-response (12 Apr 2012) Well, this should be easy…. (21 May 2012) Journal impact factors and REF2014 (19 Jan 2013)  An alternative to REF2014 (26 Jan 2013) Postgraduate education: time for a rethink (9 Feb 2013)  Ten things that can sink a grant proposal (19 Mar 2013)Blogging as post-publication peer review (21 Mar 2013) The academic backlog (9 May 2013)  Discussion meeting vs conference: in praise of slower science (21 Jun 2013) Why we need pre-registration (6 Jul 2013) Evaluate, evaluate, evaluate (12 Sep 2013) High time to revise the PhD thesis format (9 Oct 2013) The Matthew effect and REF2014 (15 Oct 2013) The University as big business: the case of King's College London (18 June 2014) Should vice-chancellors earn more than the prime minister? (12 July 2014)  Some thoughts on use of metrics in university research assessment (12 Oct 2014) Tuition fees must be high on the agenda before the next election (22 Oct 2014) Blaming universities for our nation's woes (24 Oct 2014) Staff satisfaction is as important as student satisfaction (13 Nov 2014) Metricophobia among academics (28 Nov 2014) Why evaluating scientists by grant income is stupid (8 Dec 2014) Dividing up the pie in relation to REF2014 (18 Dec 2014)  Shaky foundations of the TEF (7 Dec 2015) A lamentable performance by Jo Johnson (12 Dec 2015) More misrepresentation in the Green Paper (17 Dec 2015) The Green Paper’s level playing field risks becoming a morass (24 Dec 2015) NSS and teaching excellence: wrong measure, wrongly analysed (4 Jan 2016) Lack of clarity of purpose in REF and TEF ( 2 Mar 2016) Who wants the TEF? ( 24 May 2016) Cost benefit analysis of the TEF ( 17 Jul 2016)  Alternative providers and alternative medicine ( 6 Aug 2016) We know what's best for you: politicians vs. experts (17 Feb 2017) Advice for early career researchers re job applications: Work 'in preparation' (5 Mar 2017)  

Celebrity scientists/quackery
Three ways to improve cognitive test scores without intervention (14 Aug 2010) What does it take to become a Fellow of the RSM? (24 Jul 2011) An open letter to Baroness Susan Greenfield (4 Aug 2011) Susan Greenfield and autistic spectrum disorder: was she misrepresented? (12 Aug 2011) How to become a celebrity scientific expert (12 Sep 2011) The kids are all right in daycare (14 Sep 2011)  The weird world of US ethics regulation (25 Nov 2011) Pioneering treatment or quackery? How to decide (4 Dec 2011) Psychoanalytic treatment for autism: Interviews with French analysts (23 Jan 2012) Neuroscientific interventions for dyslexia: red flags (24 Feb 2012) Why most scientists don't take Susan Greenfield seriously (26 Sept 2014)

Women
Academic mobbing in cyberspace (30 May 2010) What works for women: some useful links (12 Jan 2011) The burqua ban: what's a liberal response (21 Apr 2011) C'mon sisters! Speak out! (28 Mar 2012) Psychology: where are all the men? (5 Nov 2012) Should Rennard be reinstated? (1 June 2014) How the media spun the Tim Hunt story (24 Jun 2015)

Politics and Religion
Lies, damned lies and spin (15 Oct 2011) A letter to Nick Clegg from an ex liberal democrat (11 Mar 2012) BBC's 'extensive coverage' of the NHS bill (9 Apr 2012) Schoolgirls' health put at risk by Catholic view on vaccination (30 Jun 2012) A letter to Boris Johnson (30 Nov 2013) How the government spins a crisis (floods) (1 Jan 2014) The alt-right guide to fielding conference questions (18 Feb 2017) We know what's best for you: politicians vs. experts (17 Feb 2017) Barely a good word for Donald Trump in Houses of Parliament (23 Feb 2017)

Humour and miscellaneous Orwellian prize for scientific misrepresentation (1 Jun 2010) An exciting day in the life of a scientist (24 Jun 2010) Science journal editors: a taxonomy (28 Sep 2010) Parasites, pangolins and peer review (26 Nov 2010) A day working from home (23 Dec 2010) The one hour lecture (11 Mar 2011) The expansion of research regulators (20 Mar 2011) Scientific communication: the Comment option (25 May 2011) How to survive in psychological research (13 Jul 2011) Your Twitter Profile: The Importance of Not Being Earnest (19 Nov 2011) 2011 Orwellian Prize for Journalistic Misrepresentation (29 Jan 2012) The ultimate email auto-response (12 Apr 2012) Well, this should be easy…. (21 May 2012) The bewildering bathroom challenge (19 Jul 2012) Are Starbucks hiding their profits on the planet Vulcan? (15 Nov 2012) Forget the Tower of Hanoi (11 Apr 2013) How do you communicate with a communications company? ( 30 Mar 2014) Noah: A film review from 32,000 ft (28 July 2014) The rationalist spa (11 Sep 2015) Talking about tax: weasel words ( 19 Apr 2016) Controversial statues: remove or revise? (22 Dec 2016) The alt-right guide to fielding conference questions (18 Feb 2017) My most popular posts of 2016 (2 Jan 2017)

Sunday, 28 May 2017

Which neuroimaging measures are useful for individual differences research?

The tl;dr version

A neuroimaging measure is potentially useful for individual differences research if variation between people is substantially greater than variation within the same person tested on different occasions. This means that we need to know about the reliability of our measures, before launching into studies of individual differences.
High reliability is not sufficient to ensure a good measure, but it is necessary.

Individual differences research

Psychologists have used behavioural measures to study individual differences - in cognition and personality - for many years. The goal is complementary to psychological research that looks for universal principles that guide human behaviour: e.g. factors affecting learning or emotional reactions. Individual differences research also often focuses on underlying causes, looking for associations with genetic, experiential and/or neurobiological differences that could lead to individual differences.

Some basic psychometrics

Suppose I set up a study to assess individual differences in children’s vocabulary. I decide to look at three measures.
  • Measure A involves asking children to define a predetermined set of words, ordered in difficulty, and scoring their responses by standard criteria.
  • Measure B involves showing the child pictured objects that have to be named.
  • Measure C involves recording the child talking with another child and measuring how many different words they use.
For each of these measures, we’d expect to see a distribution of scores, so we could potentially rank order children on their vocabulary ability. But are the three measures equally good indicators of individual differences?













We can see immediately one problem with Test B: the distribution of scores is bunched tightly, so it doesn’t capture individual variation very well. Test C, which has the greatest spread of scores, might seem the most suitable for detecting individual variation. But spread of scores, while important, is not the only test attribute to consider. We also need to consider whether the measure assesses a stable individual difference, or whether it is influenced by random or systematic factors that are not part of what we want to measure.

There is a huge literature addressing this issue, starting with Francis Galton in the 19th century, with major statistical advances in the 1950s and 1960s (see review by Wasserman & Bracken, 2003). The classical view treats test scores as a compound, with a ‘true score’ part, plus an ‘error’ part. We want a measure that minimises the impact of random or systematic error.

If there is a big influence of random error, then the test score is likely to change from one occasion to the next. Suppose we measure the same children on two occasions a month apart on three new three tests, and then plot scores on time 1 vs time 2. (To simplify this example, we assume that all three tests have the same normal distribution of scores - the same as for test A in Figure 1, and there is an average gain of 10 points from time 1 to time 2).

Figure 2














We can see that Test F is not very reliable: although there is a significant association between the scores on two test occasions, individual children can show remarkable changes from time to time. If our goal is to measure a reasonably stable attribute of the person, then Test F is clearly not suitable. aov
Just because a test is reliable, it does not mean it is valid. But if it is not reliable, then it won’t be valid. This is illustrated by this nice figure from https://explorable.com/research-methodology:
.

What about change scores?

Sometimes we explicitly want to measure change: for instance, we may be more interested in how quickly a child learns vocabulary, rather than how much they know at some specific point in time. Surely, then, we don’t want a stable measure, as it would not identify the change? Wouldn’t test F be better than D or E for this purpose?

Unfortunately, the logic here is flawed. It’s certainly possible that people may vary in how much they change from time to time, but if our interest is in change, then what we want is a reliable measure of change. There has been considerable debate in the psychological literature as to how best to establish the reliability of a change measure, but the key point is that you can find substantial change in test scores that is meaningless, and that the likelihood of it being meaningless is substantial if the underlying measure is unreliable. The data in Figure 2 were simulated by assuming that all children changed by the same amount from Time 1 to Time 2, but that tests varied in how much random error was incorporated in the test score. If you want to interpret a change score as meaningful, then the onus is on you to convince others that you are not just measuring random error.

What does this have to do with neuroimaging?

My concern with the neuroimaging literature, is that measures from functional or structural imaging are often used to measure individual differences, but it is rare to find any mention of reliability of those measures. In most cases, we simply don’t have any data on repeated testing using the same measures - or if we do, the sample size is too small, or too selected, to give a meaningful estimate of reliability. Such data as we have don’t inspire confidence that brain measurements achieve high level of reliability that is aimed for in psychometric tests. This does not mean that these measures are not useful, but it does make them unsuited for the study of individual differences.

I hesitated about blogging on this topic, because nothing I am saying here is new: the importance of reliability has been established in the literature on measurement theory since 1950. Yet, when different subject areas evolve independently, it seems that methodological practices that are seen as crucial in one discipline can be overlooked in another that is rediscovering the same issues but with different metrics.

There are signs that things are changing, and we are seeing a welcome trend for neuroscientists to start taking reliability seriously. I started thinking about blogging on this topic just a couple of weeks ago after seeing some high-profile papers that exemplified the problems in this area, but in that period, there have also been some nice studies that are starting to provide information on reliability of neuroscience measures. This might seem like relatively dull science to many, but to my mind it is a key step towards incorporating neuroscience in the study of individual differences. As I commented on Twitter recently, my view is that anyone who wants to using a neuroimaging measure as an endophenotype should first be required to establish that it has adequate reliability for that purpose.

Further reading

This review by Dubois and Adolphs (2016) covers the issue of reliability and much more, and is highly recommended.
Other recent papers of relevance:
Geerligs, L., Tsvetanov, K. A., Cam-CAN, Henson, R. N. 2017 Challenges in measuring individual differences in functional connectivity using fMRI: The case of healthy aging. Human Brain Mapping
Nord, C. L., Gray, A., Charpentier, C. J., Robinson, O. J., Roiser, J. P. 2017 Unreliability of putative fMRI biomarkers during emotional face processing.Neuroimage.

Note: Post updated on 17th June 2017 because figures from R Markdown html were not displaying correctly on all platforms.

Sunday, 23 April 2017

Sample selection in genetic studies: impact of restricted range


I'll shortly be posting a preprint about methodological quality of studies in the field of neurogenetics. It's something I've been working on with a group of colleagues for a while, and we are aiming to make recommendations to improve the field.

I won't go into details here, as you will be able to read the preprint fairly soon. Instead, what I want to do here is to expand on a small point that cropped up as I looked at this literature, and which I think is underappreciated.

It's to do with sampling. There's a particular problem that I started to think about a while back when I heard someone give a talk about a candidate gene study. I can't remember who it was or even what the candidate gene was, but basically they took a bunch of students, genotyped them, and then looked for associations between their genotypes and measures of memory. They were excited because they found some significant results. But I was, as usual, sitting there thinking convoluted thoughts about all of this, and wondering whether it really made sense. In particular, if you have a common genetic variant that has such a big effect on memory, would this really show up in a bunch of students – who are presumably people who have pretty good memories? Wouldn't it rather be the case that what you'd expect would be an alteration in the frequencies of genotypes in the student population?

Whenever I have an intuition like that, I find the best thing to do is to try a simulation. Sometimes the intuition is confirmed, and sometimes things turn out different and, very often, more complicated.

But this time, I'm pleased to say my intuition seems to have something going for it.

So here's the nuts and bolts.

I simulated genotypes and associated phenotypes by just using R's nice mvrnorm function. For the examples below, I specified that a and A are equally common (i.e. minor allele frequency is .5), so we have 25% as aa, 50% as aA, and 25% AA. The script lets you specify how closely these are related to the phenotype, but from what we know about genetics, it's very unlikely that a common variant would have a value more than about .25.

We can then test for two things:
1)  How far does the distribution of genotypes in the sample (i.e. people who are aa, aA or AA) resemble that in the general population? If we know that MAF is .5, we expect this distribution to be 1:2:1.
2) We can assign each person a score corresponding to number of A alleles (coding aa as zero, aA as 1, and AA as 2) and look at the regression of the phenotype on the genotype. That's the standard approach to looking for genotype-phenotype association.

If we work with the whole population of simulated data, these values will correspond to those that we specified in setting up the simulation, provided we have a reasonably large sample size.

But what if we take a selective sample of cases who fall above some cutoff on the phenotype? This is equivalent to taking, for instance, a sample from a student population from a selective institution, when the phenotype is a measure of cognitive function. You're not likely to get into the institution unless you have a good cognitive ability. Then, working with this selected subgroup, we recompute our two measures, i.e. the proportions of each genotype, and the correlation between the genotype and the phenotype.

Now, the really interesting thing here is that, as the selection cutoff gets more extreme, two things happen:
a) The proportions of people with different genotypes starts to depart from the values expected for the population in general. We can test to see when the departure becomes statistically significant with a chi square test.
b) The regression of the phenotype on the genotype weakens. We can quantify this effect by just computing the p-value associated with the correlation between genotype and phenotype.

Figure 1: Genotype-phenotype associations for samples selected on phenotype

Figure 1 shows the mean phenotype scores for each genotype for three samples: an unselected sample, a sample selected with z-score cutoff zero (corresponding to the top 50% of the population on the phenotype) and a sample selected with z-score cutoff of .5 (roughly selecting the top third of the population).

It's immediately apparent from the figure that the selection dramatically weakens the association between genotype and phenotype. In effect, we are distorting the relationship between genotype and phenotype by focusing just on a restricted range. 

Comparison of p-values from conventional regression approach and chi square test on genotype frequencies in relation to sample selection

Figure 2 shows the data from another perspective, by considering the statistical results from a conventional regression analysis, when different z-score cutoffs are used, selecting an increasingly extreme subset of the population. If we take a cutoff of zero – in effect selecting just the top half of the population, the regression effect (predicting phenotype from genotype), shown in the blue line, which was strong in the full population, is already much reduced. If you select only people with z-scores of .5 or above (equivalent to an IQ score of around 108), then the regression is no longer significant. But notice what happens to the black line. This shows the p-value from a chi square test which compares the distribution of genotypes in relation to expected population values in each subsample. If there is a true association between genotype and phenotype, then greater the selection on the phenotpe, the more the genotype distribution departs from expected values. The specific patterns observed will depend on the true association in the population and on the sample size, but this kind of cross-over is a typical result.

So what's the moral of this exercise? Well, if you are interested in a phenotype that has a particular distribution in the general population, you need to be careful when selecting a sample for a genetic association study. If you pick a sample that has a restricted range of phenotypes relative to the general population, then you make it less likely that you will detect a true genetic association in a conventional regression analysis. In fact, if you take a selected sample, there comes a point when the optimal way to demonstrate an association is by looking for a change in the frequency of different genotypes in the selected population vs the general population.

No doubt this effect is already well-known to geneticists, and it's all pretty obvious to anyone who is statistically savvy, but I was pleased to be able to quantify the effect via simulations. It is clear that it has implications for those who work predominantly with selected samples such as university students. For some phenotypes, use of a student sample may not be a problem, provided they are similar to the general population in the range of phenotype scores. But for cognitive phenotypes that's very unlikely, and attempting to show genetic effects in such samples seems a doomed enterprise.

The script for this simulation, simulating genopheno cutoffs.R should be available here: 
https://github.com/oscci/SQING_repo

(This link updated on 29/4/17).






Saturday, 1 October 2016

On the incomprehensibility of much neurogenetics research


Together with some colleagues, I am carrying out an analysis of methodological issues such as statistical power in papers in top neuroscience journals. Our focus is on papers that compare brain and/or behaviour measures in people who vary on common genetic variants.

I'm learning a lot by being forced to read research outside my area, but I'm struck by how difficult many of these papers are to follow. I'm neither a statistician nor a geneticist, but I have nodding acquaintance with both disciplines, as well as with neuroscience, yet in many cases I find myself struggling to make sense of what researchers did and what they found. Some papers that have taken hours of reading and re-reading to just get at the key information that we are seeking for our analysis, i.e. what was the largest association that was reported.

This is worrying for the field, because the number of people competent to review such papers will be extremely small. Good editors will, of course, try to cover all bases by finding reviewers with complementary skill sets, but this can be hard, and people will be understandably reluctant to review a highly complex paper that contains a lot of material beyond their expertise.  I remember a top geneticist on Twitter a while ago lamenting that when reviewing papers they often had to just take the statistics on trust, because they had gone beyond the comprehension of all but a small set of people. The same is true, I suspect, for neuroscience. Put the two disciplines together and you have a big problem.

I'm not sure what the solution is. Making raw data available may help, in that it allows people to check analyses using more familiar methods, but that is very time-consuming and only for the most dedicated reviewer.

Do others agree we have a problem, or is it inevitable that as things get more complex the number of people who can understand scientific papers will contract to a very small set?

Saturday, 3 September 2016

Some thoughts on the Statcheck project



Yesterday, a piece in Retractionwatch covered a new study, in which results of automated statistics checks on 50,000 psychology papers are to be made public on the PubPeer website.
I had advance warning, because a study of mine had been included in what was presumably a dry run, and this led to me receiving an email on 26th August as follows:
Assuming someone had a critical comment on this paper, I duly clicked on the link, and had a moment of double-take when I read the comment.
Now, this seemed like overkill to me, and I posted a rather grumpy tweet about it. There was a bit of to and fro on Twitter with Chris Hartgerink, one of the researchers on the Statcheck project, and with the folks at Pubpeer, where I explained why I was grumpy and they defended their approach; as far as I was concerned it was not a big deal, and if nobody else found this odd, I was prepared to let it go.
But then a couple of journalists got interested, and I sent them a more detailed thoughts.
I was quoted in the Retraction Watch piece, but I thought it worth reporting my response in full here, because the quotes could be interpreted as indicating I disapprove of the Statcheck project and am defensive about errors in my work. Neither of those is true. I think the project is an interesting piece of work; my concern is solely with the way in which feedback to authors is being implemented. So here is the email I sent to journalists in full:
I am in general a strong supporter of the reproducibility movement and I agree it could be useful to document the extent to which the existing psychology literature contains statistical errors.
However, I think there are 2 problems with how this is being done in the PubPeer study.
1. The tone of the PubPeer comments will, I suspect alienate many people. As I argued on Twitter, I found it irritating to get an email saying a paper of mine had been discussed on PubPeer, only to find that this referred to a comment stating that zero errors had been found in the statistics of that paper.
I don't think we need to be told that - by all means report somewhere a list of the papers that were checked and found to be error-free, but you don't need to personally contact all the authors and clog up PubPeer with comments of this kind.
My main concern was that during an exceptionally busy period, this was just another distraction from other things. Chris Hartgerink replied that I was free to ignore the email, but that would be extremely rash because a comment on PubPeer usually means that someone has a criticism of your paper.
As someone who works on language, I also found the pragmatics of the communication non-optimal. If you write and tell someone that you've found zero errors in their paper, the implication is that this is surprising, because you don't go around stating the obvious*. And indeed, the final part of the comment basically said that your work may well have errors in it and even though they hadn't found them, we couldn't trust it.
Now at the same time as having that reaction, I appreciate this was a computer-generated message, written by non-native English speakers, that I should not take it personally, and no slur on my work was intended. And I would like to know if errors were found in my stats, and it is entirely possible that there are some, since none of us is perfect. So I don't want to over-react, but I think that if I, as someone basically sympathetic to this agenda, was irritated by the style of the communication, then the odds are this will stoke real hostility for those who are already dubious about what has been termed 'bullying' and so on by people interested in reproducibility.
2. I'll be interested to see how this pans out for people where errors are found.
My personal view is that the focus should be on errors that do change the conclusions of the paper.
I think at least a sample of these should be hand-checked so we have some idea of the error rate - I'm not sure if this has been done, but the PubPeer comment certainly gave no indication of that - it just basically said there's probably an error in your stats but we can't guarantee that there is, putting the onus on the author to then check it out.
If it's known that on 99% of occasions the automated check is accurate, then fine. If the accuracy is only 90% I'd be really unhappy about the current process as it would be leading to lots of people putting time into checking their papers on the basis of an insufficiently sensitive diagnostic. It would make the authors of the comments look frankly lazy in stirring up doubts about someone's work and then leaving them to check it out.
In epidemiology the terms sensitivity and specificity are used to refer to the accuracy of a diagnostic test. Minimally if the sensitivity and specificity of the automated stats check is known, then those figures should be provided with the automated message.

The above was written before Dalmeet drew my attention to the second paper, in which errors had been found. Here’s how I responded to that:

I hadn't seen the 2nd paper - presumably because I was not the corresponding author on that one. It's immediately apparent that the problem is that F ratios have been reported with one degree of freedom, when there should be two. In fact, it's not clear how the automated program could assign any p-value in this situation.

I'll communicate with the first author, Thalia Eley, about this, as it does need fixing for the scientific record, but, given the sample size (on which the second, missing, degree of freedom is based), the reported p-values would appear to be accurate.
  I have added a comment to this effect on the PubPeer site.


* I was thinking here of Gricean maxims, especially maxim of relation.