The Library as a Centre of Expertise on Text and Data Mining


Presenter: Peter Verhaar

A growing number of researchers at Leiden University are experimenting with the countless innovative possibilities that can emanate from methodologies based on Text and Data Mining (TDM) and on Machine Learning. Such data-driven forms of research often allow researchers to observe patterns within data sets which are too large to be investigated via the more traditional methods. While scholars increasingly recognise that the use of TDM can open up a range of opportunities, they sometimes lack the statistical knowledge or the technical skills which are necessary for these forms of research. Since July 2016, researchers who are interested in adopting such technologies can be supported by Leiden University’s Centre for Digital Scholarship (CDS). The CDS is located physically and organisationally within the university library. It offers support (training, consultancy and services) for Open Access, Data management, Data Science and Digital Preservation. The support for Data Science includes the exploration and development of tools in support of specific research questions.

Text and Data Mining is an important instrument for Data Science, and an exceedingly broad subject, involving a wealth of different activities. To be able to develop an effective form of support for scholarship based on TDM, it is necessary to develop a clear understanding of the type of services that can be offered. The CDS is currently in the process of developing a roadmap, in which the various services to be developed are organised according to the various stages in the research lifecycle. The CDS can help scholars to acquire the data that are needed, and, if necessary, it can also help to clean and to enrich the data. Data values may be linking to ontologies, or they can be annotated through processes such as named-entity recognition, part of speech tagging or semantic tagging. The actual analysis of the data may be based on technologies such as topic modelling, support vector machines, word2Vec or Naive Bayes Classification. The CDS also wants to ensure that all the data that are associated with a study can be managed carefully, so that they can be made available for reuse. Where possible, the data need to be curated in full compliance with the FAIR principles.

Among other ways, employees of the CDS have expanded their knowledge of TDM and of Machine Learning by conducting a number of pilot projects, in close collaboration with researchers. One pilot focuses on a project in the field of social psychology, a second project concentrated on the computer-based analysis of a large corpus of Sino-Malaysian literature, and a third pilot was centred around historical data about the military invention by the Dutch government in Indonesia after the second world war.

This paper discusses the various services that the Leiden University Libraries aims to offer in the field of TDM, together with the conceptual model that is being developed for TDM support. These points will be illustrated using some of the main lessons that were learned during the pilot projects.