Corpus-based Typological Comparison

DFG-funded project 2011-2015


There is an extensive body of research available that uses corpora to investigate the struc- ture of individual languages. However, there are not many studies on quantitative, corpus- based investigations of a world-wide typological nature. This project will develop quantitative corpus-based methods for large-scale linguistic comparison. To reach this goal, we propose that it suffices to obtain a good approximation of the structure of each individual language using the same algorithmic procedures for all languages alike.

The goals of this project are threefold. First, we will prepare corpora of lesser-studied languages for typological comparisons. Because of the limited amount of research on these languages, these corpora will mainly be unannotated corpora. To be able to investigate unannotated corpora, we will also prepare a smaller amount of parallel corpora as a starting point for the automatic analysis. Second, this project will use existing algorithms and develop new algorithms to add (approximate) linguistic annotations and extract relevant statistics from the corpora, allowing for the automatic assessment of typological parameters concerning complex sentences. Finally, the main intrinsic goal of this project (to be pursued in the second phase of the Forschergruppe) is to investigate how much linguistic knowledge of a language is needed to establish a particular typological parameter.

  1. Official Title: Algorithmic corpus-based approaches to the typological comparison.

  2. Funding: DFG

  3. Duration: 2011-2014

  4. Principal Investigator: Michael Cysouw & Uwe Quasthoff

  5. Staff: Thomas Mayer & Dirk Goldhahn

  6. Budget: 234.625 EUR

  7. Application: Antrag_P6_Cysouw_Quasthoff.pdf

  8. Webpage: