Jeff Good, Max Planck Institute for Evolutionary Anthropology / The Rosetta Project
Calvin Hendryx-Parker, Six Feet Up

Modeling Contested Categorization in Linguistic Databases

Problem: A fundamental problem in the design of linguistic databases is that much of the information which needs to be encoded is of a contested nature---perhaps the best example of this is the grouping of languages into genetic units. On the one hand, one does not want to exclude contested information from a database, since it could be very useful for certain research purposes. On the other hand, it is important for such information to be clearly separated from less contested content so that (i) the database's structure does not unduly rely on contested content and (ii) a given user can choose which kinds of contested content are valuable to them at a given time.

Proposal: The present papers proposes a general database model in order to deal with the problem of encoding contested and non- contested content in linguistic databases. Fundamental to the model is giving each class of data distinct conceptualizations: non- contested content is conceptualized as ``stable'' objects, and contested content is encoded as ``ephemeral'' links to annotations on the objects. For example, an entity like Arabic, would be treated as an object to which resources classified as describing Arabic could be associated. However, contested information about Arabic would be encoded as an annotation link to the Arabic object, indicating, for example, whether or not Arabic should be considered a language with divergent dialects or a small family. In fact, many possible such classifications of Arabic could be included in the database, with associated provenance information, and users could be allowed to choose for themselves which links they believe are valid.

Implementation: Our proposed implementation for this database structure involves use of a file-system style database for content conceptualized as object-like (in particular, we have used the Zope Object Database (ZODB)) and the use of an RDF database to encode content conceptualized as link-like (in particular, we have used an SQL database interacting with a Python RDF library). For a given purpose, the user extracts which objects and which annotation links referring to those objects are useful to them for a given research purpose. The objects and links are then merged to create a rich data set for linguistic analysis.

The problem space we have focused on is the genetic and areal classification of dialects, languages, and language families. While this problem is not typically central to the task of language documentation and description, we believe our general database model could be straightforwardly extended to basic problems of grammatical analysis. In particular, typical documentary products, like transcribed texts, could be conceptualized as objects and typical descriptive products, like the grammatical analysis of a sentence, could be conceived of as linking to those objects. The database could then store competing analyses, with users choosing which analysis is most appropriate for their needs. Moreover, if our particular implementation decision to use an RDF database were replicated, this would greatly facilitate connections between databases of annotations and emerging linguistic ontologies.