All about the Data Citation Implementation Project

One of the recommendations that came out of the meeting about Data Citation our Expert Group had this spring was that Jisc should engage more with the “Data Citation Implementation Pilot“. This voluntary grouping runs as a sub-group of the Force11 collaboration, and is funded by the NIH as a part of the ongoing BioCaddie data discovery index project.

There had been an appetite in numerous places to see the Joint Declaration of Data Citation Principles (JDDCP) move quickly from aspiration to implementation. If you read through principles themselves (and if you are reading this blog and you haven’t, you might want to) you’ll note that these do not recommend specific technical implementations, metadata schema or even identifiers.  What the DCIP is doing is trying to come up with a path to making the JDDCP ideals into reality.

Five expert groups have been established, on developing advice and guidance on data citation (FAQ), identifiers, publisher early adopters, repository early adopters and JATS (the journal article tag suite).  Of these, I’ve participated in a number of virtual meetings of the publisher early adopters group since becoming a member of the DCIP. This subgroup has been focused on developing a roadmap of the publication process and identifying where the publication and citation of data needs to be considered.

When one thinks of citation one would, perhaps, think of referring to materials external to the paper in question, But the first task is to solve the issue of referring to data within the paper it underlies. There are numerous ways of doing this, with practice fragmented and often hugely variable between subjects and journal. But as journals (note, in particular, last weeks announcement from Springer Nature) begin to require that underlying data is shared, then clarity is needed on how and where this is done.

The repository early adopters group are also forging ahead, focusing on machine-readable landing pages and the use of persistent identifiers (along with the identifiers group). They also have a remit to explore community metadata standards. I’ve also seen drafts of certain parts of the FAQs aimed at repository early adopters.

The DCIP may feel like a lot of activity to keep an eye on, but it represents a drawing together of effort from many organisations, groups and individuals who have an interest in getting data citation sorted out.

(After my last post in May, the downloads end of project continues to move forward. You’ve probably already seen that the IRUS for data pilot now works with Figshare (so the increasing number of institutions formally using Figshare as an institutional data repository can get accesses COUNTER compliant data download metrics.). We’ve welcomed Loughborough and Cranfield on board via this route, with others to follow.)


Progress in research data metrics

It’s May! – and this may be the last post I write about our IRUS-based download project as an “alpha” as we edge ever closer to the “beta” stage.

But I just wanted to start by highlighting the much earlier-stage citation work – energised by a recent meeting you may have seen two blog posts drawn from work we commissioned by Cameron Neylon – one on current international work on data citation over on the main Jisc RDM blog, the other on here looking at the knotty and fascinating issues concerning the nature of citation. As a result of this I have already joined the DCIP working group and look forward to their support on our proposed collection of use cases – and there are other areas of work under consideration.

Continued excitement about Cameron Neylon’s discussion paper on data citation aside, we’re still working hard on our IRUS-based service for research data repositories – there are now 15 test sites actively sending download data and accessing statistics. Later this month we’ll be drawing representatives from these together  for the first of what I hope will be regular meetings – allowing us to understand at a very detailed level how the data our service produces is used within institutions and research data centres.

Knowing this will of course help us to improve our pilot as it eases gently into “beta”, but we’re also delighted to be able to feed in to the contemporaneous development of the COUNTER code of practice for research data. Long-time readers will recall that Project COUNTER  helps our IRUS-based services to identify “real” downloads – filtering out things like multiple clicks and web spiders. This is a huge deal – of 172,416 “downloads” since the inception of our research data IRUS (at the time of writing), only 20,710 can be considered genuine taking these rules into account.

But are these rules (currently generalised for all repository contents) the right ones for research data? Already we’ve chosen to look at downloads for a file rather than item level, but what are the other changes we should make? How should we address – for example – the growing use of research robots to analyse multiple datasets? This is some of what our intrepid repository managers will be debating, and will be what COUNTER seek eventually to codify.

We’re also pleased to report that test data has been successfully received by IRUS from both Figshare and Elsevier Pure – one of the final stages in the integration process, this means we will be able to incorporate download data from both these services (used by numerous institutions and, in the former case, individual researchers to share research data) in the very near future.

Those of you who have been following Jisc work on a research data shared service for the UK will note that our downloads service will be an integral component of the offer there. A benefit of using lightweight and widely-recognised standards is an ability to easily integrate across a range of platforms – so whatever you are using (other than, currently, Converis…) chances are we can get you set up to use the pilot service. Do get in touch if this sounds like something you would like to be involved in.

What constitutes “research data”? What is “citation”?

(a guest post from Cameron Neylon, Curtin University)

As research moves onto the web many of us are publishing and sharing our research data. As soon as we think of online as data as being published researchers are concerned about how how it going to be cited. Some researchers are even concerned about how to cite other people’s data properly. Funders and institutional administrators and publishers are also increasingly interested in tracking and monitoring the sharing and usage of research data and citation can clearly play a role here.

This focus on citation surfaces some interesting issues. What is it about citation that makes it a good way of linking publications, where we are used to citation practice, to data, where we are not? What are the risks of bringing this practice from documents into the data world. In fact, why is it that we are so keen on citation in the academy at all?

Defining data

If we’re talking about data citation we probably need to start with the question of “what is data”. In practice this question is better treated as a pragmatic issue of scope rather than something that can be absolutely determined. A practical approach is that data is any identifiable object that resides at a defined location (generally a URL or dereferenced identifier) that is generally understood to be a data repository or, where it is in a generic repository, where metadata is provided that clearly asserts the object to be data. In general this paper is interested in citable data; data for which a clearly identifiable object or location of canonical metadata is defined. While such objects, or the location of canonical metadata, need not be digital that is nonetheless the main focus.

Defining citation

In examining citation practice and functionality for data it is worth addressing the question of what a citation is, what citation is for, and how it is carried out. First and foremost citation is a practice and technology derived from formally published scholarly articles and books, developed in the print era and adapted and expanded for online publication. Data citation is a co-option of a set of tools and processes which were designed for a different context. This can be seen in two of the primary motivations discussed above, to expand the citation graph to objects traditionally not covered, and to co-opt the incentives system based around document citation to encourage data publication and sharing.

A citation is a normatively defined and formatted reference from a scholarly text (the citing article, using the SPAR ontology terms) to another scholarly text (the cited article). It generally has a context, the anchor for the citation in the text (in text reference pointer), and in a list at the end of the citing article gives the location of the cited material (the bibliographic reference) with sufficient detail to enable to relevant material to be identified and located. A citation is not a hyperlink, although in today’s online world we expect citations to be linked to the target. Nor is it a notification (or “pingback”) mechanism, although such mechanisms can now clearly be built for digital citations.

Why do we cite?

Citations serve a range of purposes, both technical and social. The reasons for citation are contested and no comprehensive social theory of citation exists. Broadly there are two sets of theories around the motivations for citation, normative, and social-constructivist. Normative theories hold that citations are a means of expressing norms of the research community, primarily the norm of assigning credit where it is due, and the norm of showing evidence transparently. Normative theories focus on shared practice and purposes that are generally held to be transparent and well understood.

Social constructivist theories focus on reasons for citation that are social or cultural, often differing across communities, that are not (necessarily) associated with the normative aspects of acknowledging intellectual debt. The Matthew Effect is an example of such a process in which those articles with more citations get more citations, not due to intellectual merit but due to social reinforcement processes. A more cynical example is the adding of new references required by the referees on an article (generally assumed to be a tactic to increase their own citation count).

As a community defines a new field specific works gain the status of being “formative” and therefore become a “canon” that is recited at the beginning of each new article. Such a process, combines aspects of both normative (intellectual debt) and social constructivist (community definition and promotion) theories.

An important realisation therefore is that citations do different things for different people and are created with differing and changing motivations. Citations from articles to data are even more complex because they are a new practice that builds on norms that developed in a different context (document to document citations) and are promoted with differing motivations (improved quality of the scholarly object graph, encouraging data sharing, obtaining credit for “alternative” forms of research contribution). We should therefore expect confusion and differences of opinion in this space as this wide range of differing perspectives comes into play.

An aside on altmetrics

While the focus of this post is not exclusively on incentives or metrics, these are clearly a major motivating factor behind the current interest in data citation. Citations are not the only potential metric of interest or indeed the only metric being collected – many data repositories provide some form of download or usage counts and there is an interest in making these more consistent through initiatives such as COUNTER, one of the aims of the “other” arm of Jisc’s Research Data Metrics for Usage project. Social media mentions associated with DataCite DOIs (and data assigned Crossref DOIs) are collected by Altmetric, a commercial provider of social media and mainstream media attention to research outputs. Kratz and Strasser (2015) report survey data that shows that researchers and repository managers are primarily interested in citations and downloads as metrics of performance.

What does this mean for data citation in practice?

Data citation is evolving rapidly – with major efforts underway within publishers, data repositories and other stakeholders to improve community practice, capture more information on the graph of relationships between research outputs, and ultimately to change the culture of research publication towards greater availability and use of data in general.

At the same time the diversity of referenced objects is increasing. Change is being driven by authors on the ground seeking a way to express normative values of assigning credit and identifying evidence alongside a wide range of motivations that are more social and cultural. This includes, for instance, the political goal of normalising the use of material from outside the formal scholarly canon as relevant to scholarship. Practice, and its many differing motivations, will therefore collide with recommendations and proscriptions from publishers and data providers and conflict should be expected. Identifying these forms of conflict and the underlying differences of perspective behind them will be an important tool in building consensus on best practice.

There is real value in probing the meaning behind citation and using this in turn to guide the pragmatics of systems and technology design. There is a substantial gap between the theoretical work on the motivation behind citation, the work on the analysis of citations, and the work on systems and technology design. These theoretical questions may seem abstruse, or they may seem like issues for which the answers should be common sense. What we find when we dig deeper is that the answers are neither obvious, nor common. And that matters if we want to get the implementation of data citation right.

In the end it is important to remember that, regardless of our individual motivations, the real reason for adopting citation into the world of data is precisely because of the social and cultural baggage that it carries.

Webinar on IRUSdata UK

Here are the slides from the webinar about IRUSdataUK on 24th Feb. A recording is also available – note that this will play via the Blackboard Collaborate platform.

A very similar session, with a live demo and more chance for questions and discussion, will be held at DigiFest16 on day 1.

If you are in the UK and run a data repository, we’d love to have you on board. Please drop David Kernohan at Jisc a note with details of your repository and we’ll get to work.


A note about IRUSdataUK

IRUS-UK has long been one of the stars in the Jisc service portfolio – offering a simple service (reliable COUNTER-compliant repository download metrics) to just under one hundred UK research repositories. IRUS is a platform-neutral service that allows reliable comparison and benchmarking between ePrintsdSpace and Fedora  based repositories – with more being added on an as-needed basis. The service has allowed institutions and repository managers to demonstrate the impact and value of their work, and to plan and grow their own activity. And that’s why we’ve been working to extend it to the emerging world of research data, via our fledgling pilot service IRUSdataUK.

There’s no reason you couldn’t plug IRUS-UK into any kind of repository – and indeed, the current landscape of mixed-use research repositories is represented amongst current users. But to optimise the IRUS platform for the emerging world of research data repositories, we’ve made one fundamental change: to measure the number of downloads at a file rather than an item level, helping the system make the best sense of the huge, multi-part datasets that can make up a single deposit. Other changes will follow as we gain a better understanding of the way statistics are being used.

Most repository platforms are happy to offer their own download statistics. What IRUS offers over and above these is the ability to filter out some of the “noise” that makes it harder to get a sense of what is actually happening: multiple clicks, incomplete downloads, and the growing range of web robots that repeatedly and randomly follow download links.

IRUS, via COUNTER, has a standardised way of dealing with these issues – for instance by sharing information on robots across numerous repositories around the world it can very easily identify and blacklist repeat offenders.

Of course, whereas IRUS-UK has focused on institutional repositories, data workflows in some disciplines involve depositing to subject specific repositories. For the IRUSdataUK pilot service, we’re delighted to be working with the UK Data Service (UKDS) – which stores and shares data relating to ESRC funded projects – as a test site. As of December 2015, their download data is being passed via our experimental IRUS instance. And, of course, where data has a DOI, we can combine records for instances in multiple places.

We already in discussion with a small number of universities (and subject repositories) to increase our range of IRUSdataUK pilot sites and will be looking to expand this over the early part of 2016. So do keep an eye on this blog for opportunities to get involved.

And of course, Jisc subscribers can sign up to IRUS-UK simply by sending an email.