Sprachvielfalt in Übersetzung.
Über die Benutzung von Übersetzungen zur Erforschung der weltweiten Sprachvielfalt

2. Symposium des Zentrums historische Sprachwissenschaften (ZhS)
München, 11. und 12. April 2014

Michael Cysouw / cysouw@uni-marburg.de
Thomas Mayer / thomas.mayer@uni-marburg.de

Using (parallel) texts
for language comparison

  • In recent years, linguists have become aware of the necessity to collect significant amounts of primary data for as many languages as possible (cf. Abney and Bird, 2010)
  • While parallel text corpora (bitexts) have been popular among computational linguists since the advent of statistical machine translation (Brown et al., 1988), there have also been some efforts to compile parallel texts in more than one language.

Using (parallel) texts
for language comparison (cont'd)

  • The most widely used multilingual text is the Europarl corpus (http://www.statmt.org/europarl/), a collection of proceedings of the European Parliament, which includes versions in 21 European languages
  • There also exist parallel texts for literary works (e.g. Harry Potter, Le Petit Prince, Master i Margarita), mostly available for a set of closely related languages
  • However, only very few of them are freely available or can be regarded as massively parallel texts in the strict sense (Cysouw and Wälchli, 2007).

Why using parallel texts
for language comparison

  • The parallel structure of the texts allows us to automatically infer differences between languages
  • Parallel texts are an indirect key to semantics by way of distribution
  • Parallel texts are thus an environment where it is possible to operationalize the notion of functional domain
  • According to Miestamo (2005: 293), a functional domain is "any domain of related (semantic or pragmatic) functions that (one or more) language(s) encode with the formal means they possess."

Massively parallel texts
collected in our project

Using (parallel) Bible texts

  • No other book has been translated into so many languages over such a long period of time as the Bible
  • Starting with its first translation, the so-called Septuagint, in 300 BC, the Bible is to the present day the object of the most intense translation activity worldwide (Noss, 2007)
  • A growing number of Bible translations are now available in electronic form on the internet. Yet until now there is no large-scale parallel Bible corpus that allows researchers to easily get access to Bible texts (but see Resnik et al., 1999 for an earlier effort to collect such a corpus)

Bible statistics

  • The Protestant biblical canon comprises 66 books of varying textual styles, ranging from poetry to prose literature and legal documents
  • The 66 books are divided into 1,189 chapters and 31,102 verses (The statistics are based on the 1769 edition of the 1611 King James Bible as presented on http://www.biblebelievers.com/believers-org/kjv-stats.html, accessed on April 22nd, 2013.)


Continent or Region Portions Testaments Bibles Total
Africa 227 334 182 743
Asia 207 265 146 618
Australia/New Zealand/Pacific Islands 138 271 40 449
Europe 107 41 62 210
North America 41 30 8 79
Caribbean islands/Central America/Mexico/South America 101 299 36 436
Constructed Languages 2 0 1 3
TOTAL 823 1240 475 2,538

Statistical summary of languages in which at least one book of the Bible had been registered as of December 31, 2001 (Source: http://www.unitedbiblesocieties.org)

A short history of Bible translation

  • The history of Bible translation can be roughly divided into four major periods (Jinbachian, 2007)
    1. 532 BC - 700 AD: early translations
    2. 700 - 1500 AD: Arab Islamic empire, Slavonic translations
    3. 1500 - 1800: invention of the printing press, translations of the Protestant reformers
    4. 1800 - : explosion of new Bible translations in different parts of the world
  • Stephen Langton, the Archbishop of Canterbury from 1207-1228, divides the Bible into chapters
  • Robert Estienne created a system of verse numbering that was first used throughout in Pierre Robert Olivétan’s French Bible in 1553 (Ellingworth, 2007, p. 114)

    Paralleltext.info  


  • Current status of the Bible corpus
  • Formats used
    • File format
    • File name conventions (BCP 47)
  • Web interface
  • Collaboration ↓

Current status of the Bible corpus

  • 1.0 Version
  • Statistics on ISO codes and translations ↓
    (different translations, different diachronic stages, different dialects)
  • Map of languages in the corpus ↓
  • We have made checks on duplicate translations
    (wrong language names/ISO codes, different formatting)

Statistics on the Bible corpus

  • 994 texts, 839 different language codes ↓
  • ResourceNumber of translations
    bible.is372
    scriptureearth.org197
    pngscriptures.org188
    Dahl140
    Unboundbible97
  • Average number of verses per translation:
    10,707 (SD: 7,727)
  • Total number of different verses: 41,964 (compare to 31,102 in KJV)

Languages in the Bible corpus

(839 different codes)

Language families in the corpus

Statistics on the Bible corpus II

  • translation with highest number of verses:
    eng-x-bible-engkj [36,986]
  • translation with lowest number of verses:
    wed-x-bible-wedau (Wedau) [677]
  • verse with highest number of translations:
    41001007 [976] (18 missing!)
    → There is not a single verse that is available for all translations!
  • average number of words per translation: 408,973 (SD: 367,572)
  • average number of types per translation: 21,176 (SD: 15,134)

Text Preparation

Texts are prepared with (automatic) linguistic analysis in mind (not theological use)
  • Bare base texts
  • No analysis included: all analysis will happen as stand-off
  • No headings, no footnotes, no cross-references
  • Unicode NFC and checks on character harmonization
  • Punctuation separated from words (problematic step!) ↓
  • No harmonization of capitalization
  • Missing lines: checking for non-consistent encoding of originals
  • Combined translations: marked as empty verses
  • Collect metadata on translations and copyright

Example: Arifama-Miniafia [aai]

  • The right single quotation mark (0x2019) stands for the glottal stop (Wakefield, 1992).
  • 40008009: Anayabin ayu i roubabaruwen ana fair biyauumaim emaam , naatu baiyowayah etei ayu babumaim temaam , imih baiyowayan orot ta isan anao , ‘ Niimaim kwen , i boro nan , naatu orot ta isan anao , ‘ Iti imaim kuna , i boro nan , naatu au bowayan orot ta isan anao , iti kusinaf , i boro nasinaf , imih turawat kuo au orot boro nayawas . ”

File format

  • The information about the book, chapter and verse number is structured as follows
    (e.g. line 3 below: 40-001-003)
    • the first two digits represent the number of the book (e.g., 40 refers to the first book in the New Testament, the Gospel according to Matthew).
    • the next three digits indicate the chapter (e.g., 001 refers to the first chapter in the book)
    • the last three digits show the verse number (e.g., 003 refers to the third verse in the chapter)
40001001\tThe book of the generations of Jesus Chris...\n
40001002\tThe son of Abraham was Isaac ; and the so...\n
40001003\tAnd the sons of Judah were Perez and Zerah...\n
...

Books of the Bible
with their two-digit codes

File name conventions

(according to language-naming convention of BCP 47)

  • ISO-x-bible-TRANSLATION-VERSION
    • ISO 639-3 code
    • x: separator for private codes in BCP 47
    • bible: tag for texts in the parallel Bible corpus
    • TRANSLATION: tag for the specific translation
      e.g. "wosera" vs. "maprik" (dialects of Ambulas), "elberfelder" (specific German translation)
    • VERSION: version number of our corpus
  • Each verse can be referenced by its URL: e.g., http://paralleltext.info/data/eng-x-bible-engkj-v1/41/001/001/

Web application on

http://paralleltext.info/data/

  • Basic functionalities
    • Browse translations (restricted to Book of Mark)
    • Search text in translations and get parallel verses
    • Download word lists (with frequencies)
    • Download sparse matrix of Words x Verses ↓
      (i.e. scrambled words per verse, no copyright)
    • Download complete texts (password protected due to copyright)
  • Alignment demo

Word and sparse matrix file


Collaboration on base texts

  • We offer to be the central repository for the base texts
    (adding new versions, correcting mistakes, updating metadata)
  • Do you have any corrections? Please just leave a comment on the website
  • We welcome help with this cleaning and preparation: please contact us! ↓

Collaboration on analysis

  • Addition of linguistic annotation should go via stand-off annotation
    • automatic: stemming, morpheme segmentation, named-entity recognition etc ...
    • manual linguistic: glossing, construction identification, etc ...
  • Basic form: CSV file with five columns using character counts:
    File name, verse number, start character, end character, annotation
  • File Name               Verse No    Start   End     Annotation          
    abc-x-bible-text-1.0    40030015     26      33      Reflexive              
  • Such files can be distributed independently of our central repository

Part II

Thank you for your attention!