Automatic NLP-based Classification of Jewish Studies Titles

Presenter: Maral Dadvar

Contributors: Rachel Heuberger, Annette Sasse, Kai Eckert

The FID Judaica is a specialized information service for the domain of Jewish studies. The project is currently underway at The Frankfurt University Library Johann Christian Senckenberg in collaboration with Stuttgart Media University, aiming to create a centralized information access point by bringing together a wide range of data sources. In this project a specialized LOD (linked open data) platform will be developed, which can be used by researchers interested in Literature relevant to Jewish studies.

In some of the data sources that are targeted to be integrated, the Jewish studies titles are already classified and indexed, and therefore we are a step ahead in adding them to the knowledge base used for enriching the data accessible through the platform. However, there are a large number of titles which are spread and lost in a pool of data consisting of titles from many different fields such as chemistry, biology, linguistics, as well as Jewish studies. As a part of FID Judaica project, we have developed a NLP (Natural Language Processing) based classification tool to automatically identify the Jewish titles. This tool is helping us to add a large number of valuable literature, especially from the period of 1600 - 1970 to the platform.

Our approach employs a classification model which processes the limited metadata available for each entry, using word level features and syntax level features. The word level features make use of a reference dataset, by which a wide range of prominent authors of Jewish literature can be identified. A dataset of Jewish studies which was previously indexed, was analyzed in order to extract the most informative elements of its titles. These elements were later on used by the syntax features to identify the Jewish studies in our non-indexed dataset. The reference datasets are frequently updated.

Before the actual analysis, a pre-processing step is involved, in order to remove the stopwords, data noise, and create the word stems. Using our approach, we have analyzed 578’806 entries in the data pool, out of which 22’140 have been identified as Jewish studies. The performance accuracy was evaluated using an independent manually labeled dataset (n=18872). The precision (positive predictive value) was 0.97, the recall (probability of detection) was 0.91, and F1 score (harmonic mean of precision and recall) was 0.94. The overall accuracy of the system was 89%. The tool was developed with reproducibility and adaptability in mind. Using language detection, it has the potential to support language-specific classification procedures and features. The tool and the reference datasets are open source and accessible to be reused for similar purposes and in other libraries.