FAIRsharing and OpenAIRE de-duplicate repository descriptions

As part of our ongoing collaboration with OpenAIRE, we have aligned and de-duplicated data repository records across multiple registries. This work is an essential step in helping users find the repositories they need within OpenAIRE Explore, and it was made possible thanks to the curation efforts of our FAIRsharing team and our OpenAIRE collaborators! More information from OpenAIRE is available in their recent news item.

Specifically, we have checked our ~1900 records describing data repositories and found the corresponding records in other registries, namely re3data and SciCrunch. As the FAIRsharing records for these repositories are imported by the OpenAIRE Explore system, the user will see only one record per repository with links to the corresponding record in FAIRsharing and the other registries.

Such links from OpenAIRE Explore to registries, such as FAIRsharing, are important because these registries offer additional information about the repositories. FAIRsharing, in particular, provides unique content by inter-linking repositories to the data and metadata standards they implement, and to the policies (of journals, funders and other organizations) that recommend them. Standards are key to FAIR data, therefore helping users to know and use a repository that implement FAIR-enabling standards is essential!

Here below we shown an example of this de-duplication effort using the Dryad record.

Entry for Dryad in OpenAIRE Explore Portal
Entry for Dryad in re3data
Entry for Dryad in SciCrunch
Entry for Dryad in FAIRsharing

During the mapping process, and in order to discover matches among FAIRsharing and other registries and records already in OpenAIRE Explore, we did the following steps:

  1. Automatically found candidate matches between the elements of both repositories using edit distances over name, URLs and alternative names.
  2. Manually checked non-exact matches with our team of in-house curators.
  3. Automatically updated the FAIRsharing records, using a Ruby scripts.

Existing links from the other registries were used to help the de-duplication process, but were manually checked because inconsistencies might be present. To perform future automated updates with manual checks at regular intervals, we have also create software tools and protocols.

As result, only about 43% of our records are also present in these other registries, while the rest are unique to FAIRsharing. The coverage of the mapping shows the difference in scope of each of the resources, as well as possible opportunities for future collaboration.

Percentage of matched repositories after the matching process was performed.

This work is represented with the re3data and SciCrunch cross references within each record that is mapped to a different registry.

Details of Dryad page in FAIRsharing, with links to other registries.

This work and the OpenAIRE and FAIRsharing collaboration will also shape activities in EOSC-Future projects, which connects to all EOSC Science Clusters.