from Inside Higher Ed

October 25, 2016

“Since the first decade of the new millennium, the words ranking, evaluation, metrics, h-index and impact factors have wreaked havoc in the world of higher education and research.”

So begins a new English edition of Bibliometrics and Research Evaluation: Uses and Abuses from Yves Gingras, professor and Canada Research Chair in history and sociology of science at the University of Quebec at Montreal. The book was originally published in French in 2014, and while its newest iteration, published by the MIT Press, includes some new content, it’s no friendlier to its subject. Ultimately, Bibliometrics concludes that the trend toward measuring anything and everything is a modern, academic version of “The Emperor’s New Clothes,” in which — quoting Hans Christian Andersen, via Gingras — “the lords of the bedchamber took greater pains than ever to appear holding up a train, although, in reality there was no train to hold.”

Gingras says, “The question is whether university leaders will behave like the emperor and continue to wear each year the ‘new clothes’ provided for them by sellers of university rankings (the scientific value of which most of them admit to be nonexistent), or if they will listen to the voice of reason and have the courage to explain to the few who still think they mean something that they are wrong, reminding them in passing that the first value in a university is truth and rigor, not cynicism and marketing.”

Although some bibliometric methods “are essential to go beyond local and anecdotal perceptions and to map comprehensively the state of research and identify trends at different levels (regional, national and global),” Gingras adds, “the proliferation of invalid indicators can only harm serious evaluations by peers, which are essential to the smooth running of any organization.”

And here is the heart of Gingras’s argument: that colleges and universities are often so eager to proclaim themselves “best in the world” — or region, state, province, etc. — that they don’t take to care to identify “precisely what ‘the best’ means, by whom it is defined and on what basis the measurement is made.” Put another way, he says, paraphrasing another researcher, if the metric is the answer, what is the question?

Without such information, Gingras warns, “the university captains who steer their vessels using bad compasses and ill-calibrated barometers risk sinking first into the storm.” The book doesn’t rule out the use of indicators to “measure” science output or quality, but Gingras says they must first be validated and then interpreted in context.

‘Evaluating Is Not Ranking’

Those looking for a highly technical overview of — or even technical screed against — bibliometrics might best look elsewhere; Bibliometrics is admittedly heavier on opinion than on detailed information science. But Gingras is an expert on bibliometrics, as scientific director of his campus’s Observatory of Science and Technology, which measures science, technology and innovation — and he draws on that expertise throughout the book.

Bibliometrics begins, for example, with a basic history of its subject, from library management in the 1950s and 1960s to science policy in the 1970s to research evaluation in the 1980s. Although evaluation of research is of course at least as old as the emergence of scientific papers hundreds of years ago, Gingras says that new layers — such as the evaluation of grants, graduate programs and research laboratories — were added in the 20th century. And beginning several decades ago, he says, qualitative, peer-led means of evaluations began to be seen as “too subjective to be taken seriously according to positivist conceptions of ‘decision making.’”

While study of publication and citation patterns, “on the proper scale, provides a unique tool for analyzing global dynamics of science over time,” the book says, the “entrenchment” of increasingly (and often ill-defined) quantitative indicators in the formal evaluation of institutions and researchers gives way to their abuses. Negative consequences of a bibliometrics-centered approach to science include the suppression of risk-taking research that might not necessarily get published, and even deliberate gaming of the system — such as paying authors to add new addresses to their papers in order to boost the position of certain institutions on various rankings.

Among other rankings systems, Gingras criticizes Nature Publishing Group’s Nature Index, which ranks countries and institutions on the basis of the papers they publish in what the index defines as high-quality science journals. Because 17 of the 68 journals included in the index are published by the Nature group, the ranking presents some conflict of interest concerns, he says — especially as institutions and researchers may begin to feel pressured into publishing in those journals. More fundamentally, measuring sheer numbers of papers published in a given group of journals — as opposed to, say, citations — “is not proof that the scientific community will find it useful or interesting.”

In a similar point that Gingras makes throughout the book, “evaluating is not ranking.”

A ‘Booming’ Evaluation Market

The intended and unintended consequences of administrative overreliance on indicators used by Academic Analytics, a productivity index and benchmarking firm that aggregates publicly available data from the web, have been cited by faculty members at Rutgers University, for example. The faculty union there has asked the university not to use information from the database in personnel and various other kinds of decisions, and to make faculty members’ profiles available to them.

David Hughes, professor of anthropology and president of the American Association of University Professors- and American Federation of Teachers-affiliated faculty union at Rutgers, wrote in an essay for AAUP’s “Academe” blog last year, for example, “What consequences might flow from such a warped set of metrics? I can easily imagine department chairs and their faculties attempting to ‘game the system,’ that is to publish in the journals, obtain the grants and collaborate in the ways that Academic Analytics counts. Indeed, a chair might feel obligated, for the good of her department, to push colleagues to compete in the race. If so, then we all lose.”

Hughes continued, “Faculty would put less energy into teaching, service and civic engagement– all activities ignored by the database. Scholarship would narrow to fit the seven quantifiable grooves. We would lose something of the diversity, heterodoxy and innovation that is, again, so characteristic of academia thus far. This firm creates incentives to encourage exactly that kind of decline.”

Faculty members at Rutgers also have cited concerns about errors in their profiles, either overestimating or underestimating their scholarship records. Similar concerns about accuracy after a study comparing faculty members’ curriculum vitae with their system profiles led Georgetown University to drop its subscription to Academic Analytics. In an announcement earlier this month, Robert Groves, provost, said quality coverage of the “scholarly products of those faculty studied are far from perfect.”

Even with perfect coverage, Groves said, “the data have differential value across fields that vary in book versus article production and in their cultural supports for citations of others’ work.” Without adequate coverage, “it seems best for us to seek other ways of comparing Georgetown to other universities.”

In response to such criticisms, Academic Analytics has said that it opposes using its data in faculty personnel decisions, and that it’s helpful to administrators as one tool among many in making decisions.

Other products have emerged as possible alternatives to products such as Academic Analytics, including Lyterati. The latter doesn’t do institutional benchmarking — at least not yet, in part because it’s still building up its client base. But among other capabilities, it includes searchable faculty web profiles with embedded analytics that automatically update when faculty members enter data. Its creators say that it helps researchers and administrators identify connections and experts that might otherwise be hard to find. Dossiers for faculty annual reports, promotion or tenure can be generated from a central database, instead of from a variety of different sources, in different styles.

The program is transparent to faculty members because it relies on their participation to build web profiles. “We are really focused on telling universities, ‘Don’t let external people come in and let people assess how good or productive or respected your faculty is — gather the data yourself and assess it yourself,’” said Rumy Sen, CEO of Entigence Corporation, Lyterati’s parent company.

Dianne Martin, former vice provost for faculty affairs at George Washington University, said she used it as vehicle for professors to file required annual reports and to populate a faculty expert finder web portal. With such data, “we can easily run reports about various aspects of faculty engagement,” such as which professors are engaged in activities related to the university’s strategic plan or international or local community activities, she said. Reports on teaching and research also can be generated on teaching and research for accreditation purposes.

But Lyterati, too, has been controversial on some campuses; Jonathan Rosenberg, Ruth M. Davis Professor of Mathematics at the University of Maryland at College Park, said that his institution dropped it last year after “tremendous faculty uproar.”

While making uniform faculty CVs had “reasonable objectives,” he said, “implementation was a major headache.” Particularly problematic was the electronic conversion process, he said, which ended up behind schedule and became frustrating when data — such as an invited talk for which one couldn’t remember an exact date — didn’t meet program input guidelines.

Gingras said in an interview that the recent debates at Rutgers and Georgetown show the dangers of using a centralized and especially private system “that is a kind of black box that cannot be analyzed to look at the quality of the content — ‘garbage in, garbage out.’”

Companies have identified a moneymaking niche, and some administrators think they can save money using such external systems, he said. But their use poses “grave ethical problems, for one cannot evaluate people on the basis of a proprietary system that cannot be checked for accuracy.”

The reason managers want centralization of faculty data is to “control scientists, who used to be the only ones to evaluate their peers,” Gingras added. “It is a kind of de-skilling of research evaluation. … In this new system, the paper is no [longer] a unit of knowledge and has become an accounting unit.”

Gingras said the push toward using “simplistic” indicators is probably worst in economics and biomedical sciences; history and the other social sciences are somewhat better in that they still have a handle on qualitative, peer-based evaluation. And contrary to beliefs held in some circles, he said, this process has always had some quantitative aspects.

The book criticizes the “booming” evaluation market, describing it is something of a Wild West in which invalid indicators are peddled alongside those with potential value. He says that most indicators, or variables that make up many rankings, are never explicitly tested for their validity before they are used to evaluate institutions and researchers.

Testing Indicators for Validity

To that point, Gingras proposes a set of criteria for testing indicators. The first criterion is adequacy, meaning that it corresponds to the object or concept being evaluated. Sometimes this is relatively simple, he says — for instance, a country’s level of investment in scientific research and development is a pretty good indicator of its research activity intensity.

But things become more complicated when trying to measure “quality” or “impact” of research, as opposed to sheer quantity, he says. For example, the number of citations a given paper receives might best be a measure of “visibility” instead of “quality.” And a bad indicator is one based on the number of Nobel Prize winners associated with a given university, since it measures the quality and quantity of work of an individual researcher, typically over decades — not the current quality of the institution as a whole. Gingras criticizes the international Shanghai Index for making Nobels part of its “pretend” institutional ranking formula.

Gingras’s second indicator criterion is sensitivity to the “inertia” of the object being measured, “since different objects change with more or less difficulty (and rapidity) over time.” Just as a thermometer that gave wildly different readings for the temperature of a room over a short period would be deemed faulty, he says, so should ranking systems that allow institutions to move significantly up or down within a single year.

Universities are like supertankers, he says, and simply can’t change course so quickly. So ranking institutions every year or even every couple of years is folly — bad science — and largely a marketing strategy from producers of such rankings. Gingras applauds the National Research Council, for example, for ranking doctoral departments in each discipline every 10 years, a much more valid interval that might just demonstrate actual change. (The research council ranking gets a lot of additional criticism for its system, however.)

A third, crucial property of indicators is their homogeneity. The number of articles published in leading scientific journals, for example, can measure research output at the national level. But if one were to combine the number of papers with a citation measure — as does the widely known h-index, which measures individual scholars’ productivity and citation impact — indicators become muddled, Gingras says.

“The fundamental problem with such heterogeneous composite indicators is that when they vary, it is impossible to have a clear idea of what the change really means, since it could be due to different factors related to each of its heterogeneous parts,” the book reads. “One should always keep each indicator separate and represent it on a spiderweb diagram, for example, in order to make the various components of the concept being measured.”

A fourth criterion that Gingras says should be implicit but isn’t is that if the value of the concept is higher, the value measured by the indicator should be higher. Some rankings consider the proportion of foreign faculty members as an indicator of success on the world stage, for example, but a 100 percent foreign faculty isn’t necessarily better than 20 percent foreign.

Academic ‘Moneyball’?

The AAUP earlier this year released a statement urging caution against the use of private metrics providers to gather data about faculty research.

Henry Reichman, a professor emeritus of history at California State University at East Bay who helped draft the document as chair of the association’s Committee A on Academic Freedom and Tenure, said faculty bibliometrics were a corollary to the interest in outcomes assessment, in that the goals of each are understandable but the means of measurement are often flawed. Faculty members aren’t necessarily opposed to the use of all bibliometrics, he added, but they should never replace nuanced processes of peer review by subject matter experts.

Bibliometrics in many ways represent a growing gap, or “gulf,” between administrators and faculty members, Reichman added; previously, many administrators were faculty members who eventually would return to the faculty. While that’s still true in many places, he said, university leaders increasingly have been administrators for many years or are drawn from other sectors, where “bottom lines” are much clearer than they are in higher education.

Brad Fenwick, vice president of global and academic research relations for Elsevier — which runs Scopus, a major competitor of Web of Science and the world’s largest citation database for peer-reviewed literature — said that as a former faculty member he understood some of the criticisms of bibliometrics.

No one metric is a sufficient measure of who’s best, he said, and any database must be “sensitive to the uniqueness of disciplines, which have different ways of communicating and sharing their scholarly work.” Elsevier’s answer was to come up with “lots of different ways of doing it, and being very, very transparent” about their processes, he said. That means faculty members have access to their profiles, for example.

Like Reichman, Fenwick said bibliometrics is not an alternative to peer review, but a complement. He compared administrators’ use of bibliometrics to baseball’s increasingly analytic approach made famous in Michael Lewis’s Moneyball: The Art of Winning an Unfair Game, in which human beings use a mix of their expertise, intuition and data to make “marginally better decisions.” And those decisions aren’t always or usually negative, he said; they might mean an administrator is able to funnel additional resources toward an emerging research focus he or she wouldn’t have otherwise noticed.

Cassidy Sugimoto, associate professor of informatics and computing at Indiana University at Bloomington and co-editor of Scholarly Metrics Under the Microscope: From Citation Analysis to Academic Auditing, said criticism of bibliometrics for evaluating scholars and scholarship is nothing new, and the field has adjusted to them over time. Issues of explicit malpractice — such as citation stacking and citation “cartels” — are addressed by suppressing data for such individuals and journals in citation indicators, for example, she said. And various distortions in interpretations, such as those caused by the “skewness” of citation indicators, wide variation in discipline and scholars’ age, have been mitigated by the adoption of more sophisticated normalizations.

“Just as with any other field of practice, scientometrics has self-corrected via organized skepticism, and bibliometrics continues to offer a productive lens to answer many questions about the structure, performance and trajectory of science,” she said.

Residual issues arising from the use of indicators for evaluation purposes are “as much sociological as scientometric problems, and are often caused not by the metric itself but in the application and interpretation of the metric,” Sugimoto added via email. “I would not call, therefore, for the cessation of the use of metrics, but rather to apply and interpret them more appropriately — at the right levels of analysis — to provide rigorous analyses of science.”

Summing up his own work, Gingras said that bibliometrics “work best at aggregate levels to indicate trends when indicators are well defined. At the level of individual evaluation they fluctuate too much to replace the usual peer review by experts who know the field and thus know what a productive scientist is.”