KOLIMO: Building and Annotating a Corpus of Modern German Literature

Herrmann J Berenike
Georg-August-Universität Göttingen

Comunicación breve
Corpora y lingüística del corpus

The proposed poster reports on philological and computational aspects of building and annotating a literary corpus. As part of the ongoing corpus-stylistic project Q-LIMO (Quantitative Analysis of Literary Modernity), the KOLIMO (Korpus der Literarischen Moderne), a representative corpus of Modern German narrative Literature, is designed to enable quantitative-stylistic analyses across variables such as narrative genres, authors, and time. The KOLIMO is tagged for part-of-speech (POS) and enriched by selected types of meta-data (e.g., author, date of publication, narrative genre). Although there are several existing repositories (such as the TextGrid Repository, the German Text Archive [DTA], as well as Gutenberg.de and Gutenberg.org), so far, no representative digital corpus of German Literary Modernity (ca. 1880 – 1930) has been presented, much less one that carries consistent and high-quality linguistic annotation and relevant meta-data. The KOLIMO is hence a unique resource; it will be made publicly available.

Building and annotating the KOLIMO poses unique challenges typical for textual analysis in Digital Humanities: (1) The first main task is a philological one, selecting the texts included with the KOLIMO. Philological standards of corpus construction are especially high in terms of editorial detail and consideration of cultural, societal and philosophical context. KOLIMO is hence balanced for factors such as canonicity, popularity, and (narrative) genre and strives for clarity in terms of literary edition. At the same time, striving for representativeness requires substantial amounts of ‘big literary data’ (ca. 30,000,000 words) that in addition to some digitization (and OCR) are in need of computational processing and preprocessing (such as producing parsed, clean, and consistently encoded texts). (2) For our means, the second main task is hence the reliable linguistic annotation for POS (with the STTS tagset for German), as well as for meta-data. Although there are high-quality POS-taggers (RF-tagger, Tree-tagger, MarMot) available, these are trained on news texts and for our means hence need manual error management (on a sample of ca. 40,000 words), as well as subsequent machine learning to facilitate annotation of the entire corpus. Next to the compilation of the first representative corpus of narrative Modern German Literature, our project will thus offer a POS-tagger able to cope with the intricacies of German narrative literary texts of the period 1880-1930.

The poster will report on the decisions made on the different levels, concerning philological text (author, genre, popularity) selection, as well as computational pre- and post-processing (e.g., preparation of clean texts, semi-automatic annotation, error analysis, software and taggers used, and supervised machine learning), as well as the degree of accuracy achieved by the KOLIMO-POS-tagger. In all, our research shows that building and enriching a particular literary corpus is by far no trivial task, but requires a sound theoretical modeling of the phenomenon constructed and an interdisciplinary method that does justice to philological as well as to computational criteria of high-quality corpus research.