Improving access to digital collections by semantic enrichment

Presenters: Theo van Veen, Juliette Lonij

The collection of digitized historical newspapers of the National Library of the Netherlands contains an abundance of information about events, persons, concepts etc. As a first step towards automatically extracting some of this information from the unstructured text and using it to improve the findability and usability of our content we have developed a method to recognize named entities and link them to knowledge bases such as DBpedia and Wikidata.

Indexing relevant properties and identifiers from these knowledge bases along with the newspaper articles opens up new possibilities for searching articles, grouping search results by semantic relations and further exploration based on context – some of which we have implemented in a demonstration web portal. Moreover, we are setting up a crowdsourcing application for manual correction of any remaining errors and obtaining additional training data.

We are continuously working on further increasing the accuracy of the links by exploring new machine learning algorithms and adding new features. Our current focus is on incorporating word and entity embeddings, as we expect these representations to be a valuable addition to our existing, mostly hand-crafted features that may be useful for “more like this” search functionality and other information extraction tasks as well.