Nothing is created, nothing is lost, everything changes”: measuring and visualising data quality in Europeana.

Presenters: Valentine Charles, Péter Király

Improving metadata quality is a challenge large digital libraries such as Europeana are confronted with. Metadata mappings, guidelines, schemas are created to ensure a minimal level of data quality but they are often insufficient. Communicating about data quality to data providers or data re-users is not an easy task. How can we assess data quality? Against which criteria: for the sake of metadata standards, for data re-use or data discovery? Finer-grained indicators are therefore required to convey measures such as the “completeness” or “degree of multilinguality” of the metadata upon which data providers can analyse and improve their metadata in order to increase its re-usability. Within the Europeana Data Quality Committee metadata experts and software developers are working together hands in hands to identify requirements and define measures for data quality.

The evaluation of the metadata schema is driven by specific discovery scenarios: which are the most important functionalities in the current Europeana services?, how particular metadata fields support them?, and how can we score the fulfillment of a specific scenario? This evaluation effort results in several criteria used for measuring the completeness of Europeana metadata records. In addition, the Committee also collects known metadata problems in a problem catalog (such as syntactical problems in a particular metadata field). Some of these problems are quite Europeana-specific, but lot of them can also be found in other metadata schemas. Our goal is to make these measurements core of a dedicated framework that will enable data providers and data re-users assessing and improving metadata quality.

On the implementation side we follow two principles: i) due to the huge amount of records, we applied easily scalable Big Data technologies for analyzing the records, such as Apache Spark, Hadoop distributed file system, and NoSQL databases, ii) our intention has always been to create a reusable software that might help other institutions as well. It is therefore designed to be flexible and easily adaptable to different metadata schemas.