Showing posts with label #REF. Show all posts
Showing posts with label #REF. Show all posts

Friday, 28 November 2014

Metricophobia among academics

Most academics loathe metrics. I’ve seldom attracted so much criticism as for my suggestion that a citation-based metric might be used to allocate funding to university departments. This suggestion was recycled this week in the Times Higher Education, after a group of researchers published predictions of REF2014 results based on departmental H-indices for four subjects.

Twitter was appalled. Philip Moriarty, in a much-retweeted plea said: “Ugh. *Please* stop giving credence to simplistic metrics like the h-index. V. damaging”. David Colquhoun, with whom I agree on many things, responded like an exorcist confronted with the spawn of the devil, arguing that any use of metrics would just encourage universities to pressurise staff to increase their H-indices.

Now, as I’ve explained before, I don’t particularly like metrics. In fact, my latest proposal is to drop both REF and metrics and simply award funding on the basis of the number of research-active people in a department.  But I‘ve become intrigued by the loathing of metrics that is revealed whenever a metrics-based system is suggested, particularly since some of the arguments put forward do seem rather illogical.

Odd idea #1 is that doing a study relating metrics to funding outcomes is ‘giving credence’ to metrics. It’s not. What would give credence would be if the prediction of REF outcomes from H-index turned out to be very good. We already know that whereas it seems to give reasonable predictions for sciences, it’s much less accurate for humanities. It will be interesting to see how things turn out for the REF, but it’s an empirical question.

Odd idea #2 is that use of metrics will lead to gaming. Of course it will! Gaming will be a problem for any method of allocating money. The answer to gaming, though, is to be aware of how this might be achieved and to block obvious strategies, not to dismiss any system that could potentially be gamed. I suspect the H-index is less easy to game than many other metrics - though I’m aware of one remarkable case where a journal editor has garnered an impressive H-index from papers published in his own journals, with numerous citations to his own work. In general, though, those of us without editorial control are more likely to get a high H-index from publishing smaller amounts of high-quality science than churning out pot-boilers.

Odd idea #3 is the assumption that the REF’s system of peer review is preferable to a metric. At the HEFCE metrics meeting I attended last month, almost everyone was in favour of complex, qualitative methods of assessing research. David Colquhoun argued passionately that to evaluate research you need to read the publications. To disagree with that would be like slamming motherhood and apple pie. But, as Derek Sayer has pointed out, it is inevitable that the ‘peer review’ component of the REF will be flawed, given that panel members are required to evaluate several hundred submissions in a matter of weeks. The workload is immense and cannot involve the careful consideration of the content of books or journal articles, many of which will be outside the reader’s area of expertise.

My argument is a pragmatic one: we are currently engaged in a complex evaluation exercise that is enormously expensive in time and money, that has distorted incentives in academia, and that cannot be regarded as a ‘gold standard’. So, as an empirical scientist, my view is that we should be looking hard at other options, to see whether we might be able to achieve similar results in a more cost-effective way.

Different methods can be compared in terms of the final result, and also in terms of unintended consequences. For instance, in its current manifestation, the REF encourages universities to take on research staff shortly before the deadline – as satirised by Laurie Taylor (see Appointments section of this article). In contrast, if departments were rewarded for a high H-index, there would be no incentive for such behaviour. Also, staff members who were not principal investigators but who made valuable contributions to research would be appreciated, rather than threatened with redundancy.  Use of an H-index would also avoid the invidious process of selecting staff for inclusion in the REF.

I suspect, anyhow, we will find predictions from the H-index are less good for REF than for RAE. One difficulty for Mryglod et al that it is not clear whether the Units of Assessment they base their predictions on will correspond to those used in REF. Furthermore, in REF, a substantial proportion of the overall score comes from impact, evaluated on the basis of case studies. To quote from the REF2014 website: “Case studies may include any social, economic or cultural impact or benefit beyond academia that has taken place during the assessment period, and was underpinned by excellent research produced by the submitting institution within a given timeframe.” My impression is that impact was included precisely to capture an aspect of academic quality that was orthogonal to traditional citation-based metrics, and so this should weaken any correlation of outcomes with H-index.

Be this as it may, I’m intrigued by people’s reactions to the H-index suggestion, and wondering whether this relates to the subject one works in. For those in arts and humanities, it is particularly self-evident that we cannot capture all the nuances of departmental quality from an H-index – and indeed, it is already clear that correlations between H-index and RAE outcomes are relatively low these disciplines. These academics work in fields where complex, qualitative analysis is essential. Interestingly, RAE outcomes in arts and humanities (as with other subjects) are pretty well predicted by departmental size, and it could be argued that this would be the most effective way of allocating funds.

Those who work in the hard sciences, on the other hand, take precision of measurement very seriously. Physicists, chemists and biologists, are often working with phenomena that can be measured precisely and unambiguously. Their dislike for an H-index might, therefore, stem from awareness of its inherent flaws: it varies with subject area and can be influenced by odd things, such as high citations arising from notoriety.

Psychologists, though, sit between these extremes. The phenomena we work with are complex. Many of us strive to treat them quantitatively, but we are used to dealing with measurements that are imperfect but ‘good enough’. To take an example from my own research. Years ago I wanted to measure the severity of children’s language problems, and I was using an elicitation task, where the child was shown pictures and asked to say what was happening. The test had a straightforward scoring system that gave indices of the maturity of the content and grammar of the responses. Various people, however, criticised this as too simple. I should take a spontaneous language sample, I was told, and do a full grammatical analysis. So, being young and impressionable I did. I ended up spending hours transcribing tape-recordings from largely silent children, and hours more mapping their utterances onto a complex grammatical chart. The outcome: I got virtually the same result from the two processes – one which took ten minutes and the other which took two days.

Psychologists evaluate their measures in terms of how reliable (repeatable) they are and how validly they do what they are supposed to do. My approach to the REF is the same as my approach to the rest of my work: try to work with measures that are detailed and complex enough to be valid for their intended purpose, but no more so. To work out whether a measure fits that bill, we need to do empirical studies comparing different approaches – not just rely on our gut reaction.

Friday, 6 August 2010

How our current reward structures have distorted and damaged science

copyright www.CartoonStock.com
Two things almost everyone would agree with:

1. Good scientists do research and publish their results, which then have impact on other scientists and the broader community.

2. Science is a competitive business: there is not enough research funding for everyone, not enough academic jobs in science, and not enough space in scientific journals. We therefore need ways of ensuring that the limited resources go to the best people.

When I started in research in the 1970s, research evaluation focused on individuals. If you wrote a grant proposal, applied for a job, or submitted a paper to a journal, evaluation depended on peer review, a process that is known to be flawed and subject to whim and bias, but is nevertheless regarded by many as the best option we have.

What has changed in my lifetime is the increasing emphasis on evaluating institutions rather than individuals. The 1980s saw the introduction of the Research Assessment Exercise, used to evaluate Universities in terms of their research excellence in order to have a more objective and rational basis for allocating central funds (quality weighted research funding or QR) by the national funding council (HEFCE in England). The methods for evaluating institutions evolved over the next 20 years, and are still a matter of debate, but they have subtly influenced the whole process of evaluation of individual academics, because of the need to use standard metrics.

This is inevitable, because the panel evaluating a subject area can't be expected to read all the research produced by staff at an institution, but they would be criticised for operating an 'old boy network', or favouring their own speciality, if they relied just on personal knowledge of who is doing good work – which was what tended to happen before the RAE. Therefore they are forced into using metrics. The two obvious things that can be counted are research income and number of publications. But number of publications was early on recognised as problematic, as it would mean that someone with three parochial reports in a journal of national society would look better than someone with a major breakthrough published in a top journal. There has therefore been an attempt to move from quantity to quality, by taking into account the impact factor of the journals that papers are published in.

Evaluation systems always change the behaviour of those being evaluated, as people attempt to maximise rewards. Recognising that institutional income depends on getting a good RAE score, vice-chancellors and department heads in many institutions now set overt quotas for their staff in terms of expected grant income and number of publications in high impact journals. The jobs market is also affected, as it becomes clear that employability depends on how good one looks on the RAE metrics.

The problem with all of this is that it means that the tail starts to wag the dog. Consider first how the process of grant funding has changed. The motivation to get a grant ought to be that one has an interesting idea and needs money to investigate it. Instead, it has turned into a way of funding the home institution and enhancing employability. Furthermore, the bigger the grant, the more the kudos, and so the pressure is on to do large-scale expensive studies. If individuals were assessed, not in terms of grant income, but in terms of research output relative to grant income, many would change status radically, as cheap, efficient research projects would rise up the scale. In psychology, there has been a trend to bolt on expensive but often unmotivated brain imaging to psychological studies, ensuring that the cost of each experiment is multiplied at least 10-fold. Junior staff are under pressure to obtain a minimum level of research funding, and consequently spend a great deal of time writing grant proposals, and the funding agencies are overwhelmed with applications. In my experience, applications that are written because someone tells you to write one are typically of poor quality, and just waste the time of both applicants and reviewers. The scientist who is successful in meeting their quota is likely to be managing several grants. This may be a good thing if they are really talented, or have superb staff, but in my experience research is done best if the principal investigator puts serious thought and time into the day-to-day running of the project, and that becomes impossible with multiple grants.

Regarding publications, I am old enough to have been publishing before the RAE, and I'm in the fortunate but unusual position of having had full-time research funding for my whole career. In the current system I am relatively safe, and I look good on an RAE return. But most people aren't so fortunate: they are trying to juggle doing research with teaching and administration, raising children and other distractions, yet feel under intense pressure to publish. The worry about the current system is that it will encourage people to cut corners, to favour research that is quick and easy. Sometimes, one is lucky, and a simple study leads to an interesting result that can be published quickly. But the best work typically requires a large investment of time and thought. The studies I am proudest of are ones which have taken years rather than months to complete: in some cases, the time is just on data collection, but in others, the time has involved reading, thinking, and working out ways of analysing and interpreting data. But this kind of paper is getting increasingly rare. As a reviewer, I frequently see piecemeal publication, so if you suggest that a further analysis would strengthen the paper, you are told that it has been done, but is the subject of another paper. Scholarship and contact with prior literature has become extremely rare: prior research is cited without reading it – or not cited at all – and the notion of research building on prior work has been eroded to the point that I sometimes think we are all so busy writing papers that we have no time to read them. There are growing complaints about an 'avalanche' of low-quality publications.

As noted above, in response to this, there has been a move to focus on quality rather than quantity of publications, with researchers being told that their work will only count if it is published in a high-impact journal. Some departments will produce lists of acceptable journals and will discourage staff from publishing elsewhere. In effect, impact factor is being used as a proxy for likelihood that a paper will be cited in future, and I'm sure that is generally true. But just because a paper in a high impact journal is likely to be highly cited, it does not mean that all highly-cited papers appear in high impact journals. In general, my own most highly-cited papers appeared in middle-ranking journals in my field. Moreover, the highest impact journals have several limitations:

1. They only take very short papers. Yes, it is usually possible to put extra information in 'supplementary material', but what you can't do is to take up space putting the work in context or discussing alternative interpretations. When I started in the field, it was not uncommon to publish a short paper in Nature, followed up with a more detailed account in another, lowlier, journal. But that no longer happens .We only get the brief account.

2. Demand for page space outstrips supply. To handle a flood of submissions, these journals operate a triage system, where the editor determines whether the paper should go out for review. This can have the benefit that rejection is rapid, but it puts a lot of power in the hands of editors, who are unlikely to be specialists in the subject area of the paper, and in some cases will explicit in their preference for papers with a 'wow' factor. It also means that one gets no useful feedback from reviewers: viz my recent experience with the New England Journal of Medicine, where I submitted a paper that I thought had all the features they'd find attractive – originality, clinical relevance and a link between genes and behaviour. It was bounced without review, and I emailed, not to appeal, but just to ask if I could have a bit more information about the criteria on which they based their rejection. I was told that they could not give me any feedback as they had not sent it out for review.


3. If the paper does go out for review, the subsequent review process can be very slow. There's an account of the trials and tribulations of dealing with Nature and Science which makes for depressing reading. Slow reviewing is clearly not a problem restricted to high impact journals. My experience is that lower-impact journals can be even worse. But the impression from the comments on FemaleScienceProfessor's blog, is that reviewers can be unduly picky when the stakes are high.

So what can be done? I'd like to see us return to a time when the purpose of publishing was to communicate, and the purpose of research funding was to enable a scientist to pursue interesting ideas. The current methods of evaluation have encouraged an unstoppable tide of publications and grant proposals, many of which are poor quality. Many scientists are spending time on writing doomed proposals and papers when they would be better off engaging in research and scholarship in a less frenetic and more considered manner. But they won't do that so long as the pressures are on them to bring in grants and generate publications. I'll conclude with a few thoughts on how the system might be improved.

1. My first suggestions, regarding publications, are already adopted widely in the UK, but my impression is they may be less common elsewhere. Requiring applicants for jobs or fellowships to specify their five best publications rather than providing a full list rewards those who publish significant pieces of work, and punishes piecemeal publication. Use of the H-index as an evaluation metric rather than either number of publications or journal impact factor is another way to encourage people to focus on producing substantial papers rather than a flood of trivial pieces, as papers with low citations have no impact whatever on the H-index.There are downsides: we have the lag problem, which makes the H-index pretty useless for evaluating junior people, and in its current form the index does not take into account the contribution of authors, thereby encouraging multiple authorship, since anyone who can get their name on a highly-cited paper will boost their H-index, regardless of whether they are a main investigator or freeloader.

2. Younger researchers should be made aware that a sole focus on publishing in very high impact journals may be counter-productive. Rapid publication in an Open Access journals (many of which have perfectly respectable impact factors) may be more beneficial to ones career (http://openaccess.eprints.org/) because the work is widely accessible and so more likely to be cited. A further benefit of the PLOS journals, for instance, is that they don't impose strict length limits, so research can be properly described and put in context, rather than being restricted to the soundbite format required by very high impact journals.

3. Instead of using metrics based on grant income, those doing evaluations should use those based on efficiency, i.e. an input/output function. Two problems here: the lag in output is considerable, and the best metric for measuring output is unclear. The lag means it would be necessary to rely on track record, which can be problematic for those starting out in the field. Nevertheless, a move in this direction would at least encourage applicants and funders to think more about value for money, rather than maximising the size of a grant – a trend that has been exacerbated by Full Economic Costing (don't get me started on that). And it might make grant-holders and their bosses see the value of putting less time and energy into writing multiple proposals and more into getting a project done well, so that it will generate good outputs on a reasonable time scale.

4. The most radical suggestion is that we abandon formal institutional rankings (i.e. the successor to the RAE, the REF). I've been asking colleagues who were around before the RAE, what they think it achieved. The general view was that the first ever RAE was a useful exercise that exposed weaknesses in insitutions and individuals and got everyone to sharpen up their act. But the costs of subsequent RAEs (especially in terms of time) have not been justified by any benefit. I remember a speech given by Prof Colin Blakemore at the British Association for the Advancement of Science some years ago where he made this point, arguing that rankings changed rather little after the first exercise, and certainly not enough to justify the mammoth bureaucratic task involved. When I talk to people who have not known life without an RAE, they find it hard to imagine such a thing, but nobody has put forward a good argument that has convinced me it should be retained. I'd be interested to see what others think.