What constitutes “research data”? What is “citation”?

(a guest post from Cameron Neylon, Curtin University)

As research moves onto the web many of us are publishing and sharing our research data. As soon as we think of online as data as being published researchers are concerned about how how it going to be cited. Some researchers are even concerned about how to cite other people’s data properly. Funders and institutional administrators and publishers are also increasingly interested in tracking and monitoring the sharing and usage of research data and citation can clearly play a role here.

This focus on citation surfaces some interesting issues. What is it about citation that makes it a good way of linking publications, where we are used to citation practice, to data, where we are not? What are the risks of bringing this practice from documents into the data world. In fact, why is it that we are so keen on citation in the academy at all?

Defining data

If we’re talking about data citation we probably need to start with the question of “what is data”. In practice this question is better treated as a pragmatic issue of scope rather than something that can be absolutely determined. A practical approach is that data is any identifiable object that resides at a defined location (generally a URL or dereferenced identifier) that is generally understood to be a data repository or, where it is in a generic repository, where metadata is provided that clearly asserts the object to be data. In general this paper is interested in citable data; data for which a clearly identifiable object or location of canonical metadata is defined. While such objects, or the location of canonical metadata, need not be digital that is nonetheless the main focus.

Defining citation

In examining citation practice and functionality for data it is worth addressing the question of what a citation is, what citation is for, and how it is carried out. First and foremost citation is a practice and technology derived from formally published scholarly articles and books, developed in the print era and adapted and expanded for online publication. Data citation is a co-option of a set of tools and processes which were designed for a different context. This can be seen in two of the primary motivations discussed above, to expand the citation graph to objects traditionally not covered, and to co-opt the incentives system based around document citation to encourage data publication and sharing.

A citation is a normatively defined and formatted reference from a scholarly text (the citing article, using the SPAR ontology terms) to another scholarly text (the cited article). It generally has a context, the anchor for the citation in the text (in text reference pointer), and in a list at the end of the citing article gives the location of the cited material (the bibliographic reference) with sufficient detail to enable to relevant material to be identified and located. A citation is not a hyperlink, although in today’s online world we expect citations to be linked to the target. Nor is it a notification (or “pingback”) mechanism, although such mechanisms can now clearly be built for digital citations.

Why do we cite?

Citations serve a range of purposes, both technical and social. The reasons for citation are contested and no comprehensive social theory of citation exists. Broadly there are two sets of theories around the motivations for citation, normative, and social-constructivist. Normative theories hold that citations are a means of expressing norms of the research community, primarily the norm of assigning credit where it is due, and the norm of showing evidence transparently. Normative theories focus on shared practice and purposes that are generally held to be transparent and well understood.

Social constructivist theories focus on reasons for citation that are social or cultural, often differing across communities, that are not (necessarily) associated with the normative aspects of acknowledging intellectual debt. The Matthew Effect is an example of such a process in which those articles with more citations get more citations, not due to intellectual merit but due to social reinforcement processes. A more cynical example is the adding of new references required by the referees on an article (generally assumed to be a tactic to increase their own citation count).

As a community defines a new field specific works gain the status of being “formative” and therefore become a “canon” that is recited at the beginning of each new article. Such a process, combines aspects of both normative (intellectual debt) and social constructivist (community definition and promotion) theories.

An important realisation therefore is that citations do different things for different people and are created with differing and changing motivations. Citations from articles to data are even more complex because they are a new practice that builds on norms that developed in a different context (document to document citations) and are promoted with differing motivations (improved quality of the scholarly object graph, encouraging data sharing, obtaining credit for “alternative” forms of research contribution). We should therefore expect confusion and differences of opinion in this space as this wide range of differing perspectives comes into play.

An aside on altmetrics

While the focus of this post is not exclusively on incentives or metrics, these are clearly a major motivating factor behind the current interest in data citation. Citations are not the only potential metric of interest or indeed the only metric being collected – many data repositories provide some form of download or usage counts and there is an interest in making these more consistent through initiatives such as COUNTER, one of the aims of the “other” arm of Jisc’s Research Data Metrics for Usage project. Social media mentions associated with DataCite DOIs (and data assigned Crossref DOIs) are collected by Altmetric, a commercial provider of social media and mainstream media attention to research outputs. Kratz and Strasser (2015) report survey data that shows that researchers and repository managers are primarily interested in citations and downloads as metrics of performance.

What does this mean for data citation in practice?

Data citation is evolving rapidly – with major efforts underway within publishers, data repositories and other stakeholders to improve community practice, capture more information on the graph of relationships between research outputs, and ultimately to change the culture of research publication towards greater availability and use of data in general.

At the same time the diversity of referenced objects is increasing. Change is being driven by authors on the ground seeking a way to express normative values of assigning credit and identifying evidence alongside a wide range of motivations that are more social and cultural. This includes, for instance, the political goal of normalising the use of material from outside the formal scholarly canon as relevant to scholarship. Practice, and its many differing motivations, will therefore collide with recommendations and proscriptions from publishers and data providers and conflict should be expected. Identifying these forms of conflict and the underlying differences of perspective behind them will be an important tool in building consensus on best practice.

There is real value in probing the meaning behind citation and using this in turn to guide the pragmatics of systems and technology design. There is a substantial gap between the theoretical work on the motivation behind citation, the work on the analysis of citations, and the work on systems and technology design. These theoretical questions may seem abstruse, or they may seem like issues for which the answers should be common sense. What we find when we dig deeper is that the answers are neither obvious, nor common. And that matters if we want to get the implementation of data citation right.

In the end it is important to remember that, regardless of our individual motivations, the real reason for adopting citation into the world of data is precisely because of the social and cultural baggage that it carries.

Leave a Reply

Your email address will not be published. Required fields are marked *