Showing posts with label REF. Show all posts
Showing posts with label REF. Show all posts

Wednesday, 2 March 2016

On the need for clarity of purpose in the REF and TEF

©CartoonStock.com


The UK’s Research Evaluation Framework (REF) has come in for a lot of criticism. It is now under review by a panel chaired by Nicholas Stern, with a call for evidence that closes later this month. At the same time, we have a Green Paper setting out plans  for a Teaching Excellence Framework (TEF). This is motivated in part by the view that the attention given to research and teaching has got out of balance. REF has provided universities with strong incentives to put resources into research, and teaching has consequently been neglected, goes the argument (though see here). So what do we need to even things up? A TEF.

The problem for both REF and TEF is that, at the end of the day, they aim for a single scale on which universities can be rank ordered so we can compare quality. But everyone agrees that the things we are measuring, research and teaching excellence, are complex and multifactorial.
There are basically two ways forward. Option A is to use some kind of proxy measure, recognising its limitations but taking the view that it is good enough for purpose. Option B involves trying to measure the complex multifactorial construct in all its richness.
There are a number of factors that influence choice of approach. Because everyone recognises that things are complex, Option A is unlikely to be acceptable to the academic community. Simple measures are often easy to game. On the other hand, the complex multifactorial measures of Option B can be debated endlessly, often involve elements of subjective judgement, are not immune to gaming, can be extremely expensive to administer, and can be hard to integrate into a single ranking.
James Wilsdon has noted with regard to the REF, before deciding which system of measurement to use, we have to have a clear idea of what we are trying to achieve.  As far as the REF goes, its purpose has changed and mutated over the years. It started out with a pretty simple goal: to find a formula to determine allocation of quality-related (QR) funding from central government to universities. However, as Wilsdon notes, it has subsequently been used for four additional purposes: to demonstrate accountability, to provide a measure of reputation, to influence research culture, and as a tool within universities for managing academics. He notes that: “If all we want from the REF is a QR allocation tool, then we can certainly do that in an algorithmic, metric-based way”(i.e. Option A). But he argues the REF needs to fulfil the other functions too, and, as was amply demonstrated in his report the Metric Tide, for those other purposes, a simple metrics-based system is inadequate.
I agree with much of what Wilsdon says, but I think we could save ourselves a lot of trouble by reverting to the original purpose of the REF, i.e. treat it purely as a mechanism for allocating funding. As I have argued previously, if that is all you want to do, then you don’t even need to bother with metrics of the kind discussed in his report. A simple measure of the number of active researchers present in a department gives a remarkably high correlation with the amount of QR funding received – and this works well for most subjects in arts and humanities as well as sciences.
But what about gaming? When I proposed this idea a couple of years ago, people said, wouldn’t universities just designate the departmental cleaner as an active researcher, or take on more research staff? I don’t see these problems as insuperable. It would be important to specify stringent criteria for research staff to meet: these would include terms of employment (casual staff would be excluded), as well as evidence of research activity. If one counted only those staff who had been employed at the institution for some minimum period, such as 3-4 years, this should prevent institutions catapulting in overseas researchers on Mickey Mouse contracts, or taking on short-term staff to give a temporary blip in researcher numbers.
A more serious objection to my proposal is that there is no explicit measure of research quality – an institution could take on a large number of weak researchers and look as good as a competitor with an equal number of excellent researchers. But would this happen? Remember, researchers would need to be on the institutional payroll for a period of 3-4 years prior to the evaluation, so the institution would need to commit to the expense of employing them. This would not be worthwhile if staff then failed to meet the criteria set for research-active staff. Academics who did not count as active researchers would end up being a net cost to the institution.
I’m not saying that it would be easy to fine-tune such a system to avoid gaming or unintended consequences, just that it could be done, and I suspect would be much less difficult than devising an entirely separate system for evaluating research quality.
My case falls apart if, like Wilsdon (and many other people who have been involved in REF) you think REF should fulfil additional purposes. Then, because no one measure is suitable for all purposes, you need something much more complicated. But I do agree with Wilsdon that, if that’s what you want, you need to be clear about it – and about the need for a diverse set of measures appropriate to different goals.
What about TEF? Well, when you dig beneath the surface, you find that the parallels between REF and TEF are purely superficial. The purpose of TEF is not to allocate funding – there is no funding to allocate. The stated purposes are as complex and multifactorial as the notion of teaching excellence itself: to help students select courses, to increase access of under-represented groups to higher education, to provide a basis for allowing universities to raise fees, and to provide criteria for ‘new entrants’ (i.e. private institutions) that wish to enter the higher education market. According to a recent BIS Select Committee report, it’s also intended to provide incentives: “to ensure that higher education institutions meet student expectations and improve on their leading international position.” Quite what it means to improve on a leading international position is not specified.
In attempting to develop a measure that will cover all these functions, those promoting the TEF have tied themselves in knots, as illustrated by this wonderfully circular statement from the same Select Committee report:
In the absence of any agreed definition or recognised measures of teaching quality, the Government is proposing to use measures, or metrics, as proxies for teaching quality. Therefore the challenge is to identify those metrics which most reliably and accurately measure teaching quality, as opposed to other factors that contribute to the results achieved by students.
This is worrying. The only positive thing one can say is that there are signs that government may be starting to recognise some of the problems. The Select Committee report cautions the need not to rush into a TEF, and notes reservations both about the measures proposed and the proposed link between TEF and fee-raising powers. The report concludes by encouraging academics to work with BIS to develop appropriate metrics for TEF – the impression is that government is aware if they get it wrong then universities may just decide not to play ball. One of the members of the Select Committee, Amanda Milling, wrote in the Times Higher that “the higher education sector has a responsibility to engage with TEF to make it work.”
But do we? I would argue that the responsibility lies with the Minister, to make a proper case for the TEF.
As the Select Committee report points out: “It is important to note the high quality of teaching generally available in our higher education system at present…..The debate around teaching excellence should therefore be viewed within the context of enhancing an already excellent system or, as the Minister for Universities and Science put it, ‘to continue to make a great sector greater still’”. These weasel words mean that if universities resist TEF, they can be accused of complacency. But where’s the evidence that TEF will ‘make a great sector greater still’? A considerable amount of time and money will be sucked up by this exercise, which has multiple confused aims and has potential to tie up a great sector in pointless bureaucracy and waffle. The whole idea is seriously misconceived and has been rushed through without adequate justification or cost-benefit analysis.
We are now being told that TEF will be introduced by degrees, with measures being developed over time, but I am not reassured. If the government wants academics on side, it needs to demonstrate more coherent arguments, with clear specification of the goals of the TEF, and evidence of validity of the measures it proposes to achieve those goals. And most of all, it needs to show us that more good than harm will result from this exercise.

Sunday, 12 October 2014

Some thoughts on use of metrics in university research assessment

The UK’s Research Excellence Framework (REF) is like a walrus: it is huge, cumbersome and has a very long gestation period. Most universities started preparing in earnest for the REF early in 2011, with submissions being made late in 2013. Results will be announced in late December, just in time to cheer up our seasonal festivities.
 
Like many others, I have moaned about the costs of the REF: not just in money, but also the time spent by university staff, who could be more cheerfully and productively engaged in academic activities. The walrus needs feeding copious amounts of data: research outputs must be carefully selected and then graded in terms of research quality. Over the summer, those dedicated souls who sit on REF panels were required to read and evaluate several hundred papers. Come December, the walrus digestive system will have condensed the concerted ponderings of some of the best academic minds in the UK into a handful of rankings.

But is there a viable alternative? Last week I attended a fascinating workshop on the use of metrics in research. I had earlier submitted comments to an independent review of the role of metrics in research assessment from the Higher Education Funding Council for England (HEFCE), arguing that we need to consider cost-effectiveness when developing assessment methods. The current systems of evaluation have grown ever more complex and expensive, without anyone considering whether the associated improvements justified the increasing costs. My view is that an evaluation system need not be perfect – it just needs to be ‘good enough’ to provide a basis for disbursement of funds that can be seen to be both transparent and fair, and which does not lend itself readily to gaming.

Is there an alternative?
When I started preparing my presentation, I had intended to talk just about the use of measures of citations to rank departments, using analysis done for an earlier blogpost, as well as results from this paper by Mryglod et al. Both sources indicated that, at least in sciences, the ultimate quality-related research (QR) funding allocation for a department was highly correlated with a department-based measure of citations. So I planned to make the case that if we used a citation-based metric (which can be computed by a single person in a few hours) we could achieve much the same result as the full REF process for evaluating outputs, which takes many months and involves hundreds of people.
However, in pondering the data, I then realised that there was an even better predictor of QR funding per department: simply the number of staff entered into the REF process.

Before presenting the analysis, I need to backtrack to just explain the measures I am using, as this can get quite confusing. HEFCE deserves an accolade for its website, where all the relevant data can be found. My analyses were based on the 2008 Research Assessment Exercise (RAE).  In what follows I used a file called QR funding and research volume broken down by institution and subject, which is downloadable here. This contains details of funding for each institution and subject for 2009-2010. I am sure the calculations I present here have been done much better by others and I hope they will not by shy to inform me if there are mistakes in my working.

The variables of interest are:
  • The percentages of research falling in each star band in the RAE. From this, one can compute an average quality rating, by multiplying 4* by 7, 3* by 3, and 2* by 1 and adding these, and dividing the total by 100. Note that this figure is independent of department size and can be treated as an estimate of the average quality of a researcher in that department and subject.
  • The number of full-time equivalent research-active staff entered for the RAE. This is labelled as the ‘model volume number’, but I will call it Nstaff. (In fact, the numbers given in the 2009-2010 spreadsheet are slightly different from those used in the computation, for reasons I am not clear about, but I have used the correct numbers, i.e. those in HEFCE tables from RAE2008).
  • The departmental quality rating: this is average quality rating x Nstaff. (Labelled as “model quality-weighted volume” in the file). This is summed across all departments in a discipline to give a total subject quality rating (labelled as “total quality-weighted volume for whole unit of assessment”).
  • The overall funds available for the subject are listed as “Model total QR quanta for whole unit of assessment (£)”. I have not been able to establish how this number is derived, but I assume it has to do with the size and cost of the subject, and the amount of funding available from government.
  • QR (quality-related) funding is then derived by dividing the departmental quality rating by the total subject quality rating and multiplying by overall funds. This gives the sum of QR money allocated by HEFCE to that department for that year, which in 2009 ranged from just over £2K (Coventry University, Psychology) to over £12 million (UCL, Hospital-based clinical subjects). The total QR allocation in 2009-2010 for all disciplines was just over £1 billion.
  • The departmental H-index is taken from my previous blogpost. It is derived by doing a Web of Knowledge search for articles from the departmental address, and then computing the H-index in the usual way. Note that this does not involve identifying individual scientists.
Readers who are still with me may have noticed that we'd expect QR funding for a subject to be correlated with Nstaff, because Nstaff features in the formula for computing QR funding. And this makes sense, because departments with more research staff require greater levels of funding. A key question is just how much difference does it make to the QR allocation if one includes the quality ratings from the RAE in the formula.

Size-related funding
To check this out, I computed an alternative metric, size-related funding, which multiplies the overall funds by the proportion of Nstaff in the department relative to total staff in that subject across all departments. So if across all departments in the subject there are 100 staff, a department with 10 staff would get .1 of the overall funds for the subject.

Table 1 shows: the correlation between Nstaff and QR funding (r QR/Nstaff) and how much a department would typically gain or lose if size-related funding were adopted, expressing the absolute difference as a percentage of QR funding (± % diff).

Table 1: Mean number of staff and QR funding by subject, with correlation between QR and N staff, and mean difference between QR funding and size-related funding





Mean Mean r QR/ ± %
Subject Nstaff QR £K Nstaff diff
Cardiovascular Medicine 26.3 794 0.906 23
Cancer Studies 38.1 1,330 0.939 13
Infection and Immunology 43.7 1,506 0.971 22
Other Hospital Based Clinical Subjects 58.2 1,945 0.986 23
Other Laboratory Based Clinical Subjects 21.8 685 0.952 41
Epidemiology and Public Health 26.6 949 0.986 25
Health Services Research 21.9 659 0.900 24
Primary Care & Community Based Clinical  10.4 370 0.790 29
Psychiatry, Neuroscience & Clinical Psychology 46.7 1,402 0.987 15
Dentistry 31.1 1,146 0.977 13
Nursing and Midwifery 18.0 487 0.930 32
Allied Health Professions and Studies 20.4 424 0.884 36
Pharmacy 27.5 899 0.936 24
Biological Sciences 45.1 1,649 0.978 19
Pre-clinical and Human Biological Sciences 49.4 1,944 0.887 18
Agriculture, Veterinary and Food Science 33.2 999 0.976 21
Earth Systems and Environmental Sciences 28.6 1,128 0.971 14
Chemistry 37.9 1,461 0.969 18
Physics 44.0 1,596 0.994 8
Pure Mathematics 18.4 489 0.957 24
Applied Mathematics 20.0 614 0.988 19
Statistics and Operational Research 12.6 406 0.953 19
Computer Science and Informatics 22.9 769 0.954 26
Electrical and Electronic Engineering 23.8 892 0.982 17
General Engineering; Mineral/Mining Engineering 28.9 1,073 0.958 30
Chemical Engineering 26.6 1,162 0.968 15
Civil Engineering 23.2 1,005 0.960 19
Mech., Aeronautical, Manufacturing Engineering 35.7 1,370 0.987 14
Metallurgy and Materials 21.1 807 0.948 24
Architecture and the Built Environment 18.7 436 0.961 23
Town and Country Planning 15.1 306 0.911 27
Geography and Environmental Studies 22.8 505 0.969 21
Archaeology 20.7 518 0.990 12
Economics and Econometrics 25.7 581 0.968 20
Accounting and Finance 11.7 156 0.982 19
Business and Management Studies 38.7 630 0.964 27
Library and Information Management 16.3 244 0.935 26
Law 26.6 426 0.960 30
Politics and International Studies 22.4 333 0.955 31
Social Work and Social Policy & Administration 19.1 324 0.944 26
Sociology 24.1 404 0.933 24
Anthropology 18.6 363 0.946 12
Development Studies 21.7 368 0.936 25
Psychology 21.1 424 0.919 35
Education 21.0 346 0.983 34
Sports-Related Studies 13.5 231 0.952 37
American Studies and Anglophone Area Studies 10.9 191 0.988 11
Middle Eastern and African Studies 17.7 393 0.978 17
Asian Studies 15.9 258 0.938 26
European Studies 20.1 253 0.787 30
Russian, Slavonic and East European Languages 8.7 138 0.973 22
French 12.6 195 0.979 16
German, Dutch and Scandinavian Languages 8.4 129 0.966 17
Italian 6.3 111 0.865 20
Iberian and Latin American Languages 9.1 156 0.937 17
Celtic Studies 0.0 328

English Language and Literature 20.9 374 0.982 26
Linguistics 11.7 168 0.956 18
Classics, Ancient History, Byzantine and Modern Greek Studies 19.4 364 0.992 22
Philosophy 14.4 258 0.987 23
Theology, Divinity and Religious Studies 11.4 174 0.958 32
History 20.8 366 0.988 21
Art and Design 22.7 419 0.955 37
History of Art, Architecture and Design 10.7 213 0.960 18
Drama, Dance and Performing Arts 9.8 221 0.864 36
Communication, Cultural and Media Studies 11.9 195 0.860 29
Music 10.6 259 0.863 33

Correlations between Nstaff and QR funding are very high –above .9. Nevertheless, this analysis shows that, as is evident in Table 1, if we substituted size-related funding for QR funding, the amounts gained or lost by individual departments can be substantial.  In some subjects, though, mainly in the Humanities, where overall QR allocations are anyhow quite modest, the difference between size-related and QR funding is not large in absolute terms. In such cases, it might be rational to allocate funds solely by Nstaff and ignore quality ratings.  The advantage would be an enormous saving in time – one could bypass the RAE or REF entirely. This might be a reasonable option if the amount of expenditure on the RAE/REF by the department exceeds any potential gain from inclusion of quality ratings.

Is the departmental H-index useful?
If we assume that the goal is to have a system that approximates the outcomes of the RAE (and I’ll come back to that later) then for most subjects you need something more than Nstaff. The issue then is whether an easily computed department-based metric such as the H-index or total citations could add further predictive power. I looked at the figures for two subjects where I had computed the departmental H-index: Psychology and Physics.  As it happens, Physics is an extreme case: the correlation between Nstaff and QR funding was .994. Adding an H-index does not improve prediction because there is virtually no variance left to explain. As can be seen from Table 1, Physics is a case where use of size-related funding might be justified, given that the difference between size-related and QR funding averages out at only 8%.

For Psychology, adding the H-index to the regression explains a small but significant 6.2% of additional variance, with the correlation increasing to .95.

But how much difference would it make in practice if we were to use these readily available measures to award funding instead of the RAE formula? The answer is more than you might think, and this is because the range in award size is so very large that even a small departure from perfect prediction can translate into a lot of money.

Table 2 shows the different levels of funding that departments would accrue depending on how the funding formula is computed. The full table is too large and complex to show here, so I'll just show every 8th institution. As well as comparing alternative size-related and H-index-based (QRH) metrics with the RAE funding formula (QR0137), I have looked at how things change if the funding formula is tweaked: either to give more linear weighting to the different star categories (QR1234), or to give more extreme reward for the highest 4* category (QR0039) – something which is rumoured to be a preferred method for REF2014. In addition, I have devised a metric that has some parallels with the RAE metric, based on the residual of the H-index after removing effect of departmental size. This could be used as an index of quality that is independent of size; it correlates with r = .87 with the RAE average quality rating. To get an alternative QR estimate, it was substituted for the average quality rating in the funding formula to give the Size.Hres measure.

Table 2: Funding results in £K from different metrics for seven Psychology departments representing different levels of QR funding


institution QR0137 Size-related QR1234 QR0039 QRH Size.Hres
A 1891 1138 1424 2247 1416 1470
B 812 585 683 899 698 655
C 655 702 688 620 578 576
D 405 363 401 400 499 422
E 191 323 276 121 279 304
F 78 192 140 44 299 218
G 26 161 81 13 60 142

To avoid invidious comparisons, I have not labelled the departments, though anyone who is curious about their identity could discover them quite readily.  The two columns that use the H-index tend to give similar results, and are closer to a QR funding based that treats the four star ratings as equal points on a scale (QR1234). It is also apparent that a move to QR0039 (where most reward is given for 4* research and none for 1* or 2*) will increase the share of funds to those institutions who are already doing well, and decrease it for those who already have poorer income under the current system. One can also see that some of the Universities at the lower end of the table – all of them post 1992 universities – seem disadvantaged by the RAE metric, in that the funding they received seems low relative to both their size and the H-index.

The quest for a fair solution
So what is a fair solution? Here, of course, lies the problem. There is no gold standard. There has been a lot of discussion about whether we should use metrics, but much less discussion of what we are hoping to achieve with a funding allocation.

How about the idea that we could allocate funds simply on the basis of the number of research-active staff? In a straw poll I’ve taken, two concerns are paramount.

First, there is a widely held view that we should give maximum rewards to those with highest quality research, because this will help them maintain their high standing, and incentivise others to do well. This is coupled with a view that we should not be rewarding those who don’t perform. But how extreme do we want this concentration of funding to be? I’ve expressed concerns before that too much concentration in a few elite institutions is not good for UK academia, and that we should be thinking about helping middle-ranking institution become elite, rather than focusing all our attention on those who have already achieved that status. The calculations from RAE in Table 2 show how a tweaking of the funding formula to give higher weighting to 4* research will take money from the poorer institutions and give it to the richer ones: it would be good to see some discussion of the rationale for this approach.

The second source of worry is the potential for gaming. What is to stop a department from entering all their staff, or boosting numbers by taking on extra staff? The first point could be dealt with by having objective criteria for inclusion, such as some minimal number of first- or last-authored publications in the reporting period.  The second strategy would be a risky one, since the institution would have to provide salaries and facilities for the additional staff, and this would only be cost-effective if the QR allocation would cover it. Of course, a really cynical gaming strategy would be to hire people briefly for the REF and then fire them once it is over. However, if funding were simply a function of number of research-active staff, it would be easy to do an assessment annually, to deter such short-term strategies.

How about the departmental H-index? I have shown that it not only is a fairly good predictor of RAE QR funding outcomes on its own, incorporating as it does both aspects of departmental size and research quality, but it also correlates with the RAE measure of quality, once the effect of departmental size is adjusted for. This is all the more impressive when one notes that the departmental H-index is based on any articles listed as coming from the departmental address, whereas the quality rating is based just on those articles submitted to the RAE.

There are well-rehearsed objections to the use of citation metrics such as the H-index: first any citation-based measure is useless for very recent articles. Second, citations vary from discipline to discipline, and in my own subject, Psychology, within sub-disciplines.. Furthermore, the H-index can be gamed to some extent by self-citation, or scientific cliques, and one way of boosting it is to insist on having your name on any publication you are remotely connected with - though the latter strategy is more likely to work for the H-index of the individual than for the H-index of the department. It is easy to find anecdotal instances of poor articles that are highly cited and good articles that are neglected.  Nevertheless, it may be a ‘good enough’ measure when used in aggregate: not to judge individuals but to gauge the scientific influence of work coming from a given department over a period of a few years.

The quest for a perfect measure of quality
I doubt that either of these ‘quick and dirty’ indices will be adopted for future funding allocations, because it’s clear that most academics hate the idea of anything so simple. One message frequently voiced at the Sussex meeting was that quality is far too complex to be reduced to a single number.  While I agree with that sentiment, I am concerned that in our attempts to get a perfect assessment method, we are developing systems that are ever more complex and time-consuming. The initial rationale for the RAE was that we needed a fair and transparent means of allocating funding after the 1992 shake-up of the system created many new universities. Over the years, there has been mission creep, and the purpose of the RAE has been taken over by the idea that we can and should measure quality, feeding an obsession with league tables and competition. My quest for something simpler is not because I think quality is simple, but rather because I think we should use the REF just as a means to allocate funds. If that is our goal, we should not reject simple metrics just because we find them oversimplistic: we should base our decisions on evidence and go for whatever achieves an acceptable outcome at reasonable cost. If a citation-based metric can do that job, then we should consider using it unless we can demonstrate that something else works better.

I'd be very grateful for comments and corrections.

Reference  
Mryglod, O., Kenna, R., Holovatch, Y., & Berche, B. (2013). Comparison of a citation-based indicator and peer review for absolute and specific measures of research-group excellence Scientometrics, 97 (3), 767-777 DOI: 10.1007/s11192-013-1058-9