Archives, morphological analysis, & XML encoding: interdisciplinary methods in the creation of a digital text explorer for Colonial Zapotec manuscripts

Brook Danielle Lillehaugen
Haverford College
George Aaron Broadwell
University al Albany
Michel R. Oudijk
Laurie Allen
Haverford College
Enrique Valdivia
University of Michigan

Zapotec is a language family indigenous to southern Mexico. Today, there are over 50 different Zapotec languages, most endangered, spoken primarily in the state of Oaxaca, Mexico, by a total of approximately 425,000 (INEGI 2010) people within a much larger Zapotec ethnic community. The Zapotec language family is on par with the Romance language family in terms of time depth and diversity of member languages. The Zapotecs are one of the major civilizations of Mesoamerica with cultural traditions going back to 500 B.C. and distinct from the better known Nahua (Aztec) and Maya.

With the arrival of the Spanish in 1519, the colonial period began. Alphabetic writing was introduced and quickly adopted by indigenous peoples. Zapotec has one of the longest records of alphabetic written documents for any indigenous language of the Americas. Over 900 documents written in Zapotec by native scribes have been identified, the earliest from 1565 (Oudijk 2008a: 230). There is an extensive dictionary (Cordova 1578b), grammar (Cordova 1578a), doctrine (Feria 1567), and hundreds of handwritten administrative documents, such as wills and bills of sale. Apart from the work of a small handful of ethnohistorians (e.g. Oudijk 2008b, Tavárez 2010, Schrader-Kniffki & Yannakakis 2014), archaeologists (Marcus & Flannery 1994, Zulauf 2013) and linguists (Smith Stark 2003, Rojas Torres 2009), these indigenous Zapotec writings have been largely ignored.

Reading and translating these documents is extremely difficult. The language in these documents is significantly different from modern Zapotec languages. The orthography of such texts is very inconsistent and there is no fully adequate ZapotecSpanish or Zapotec-English dictionary for the Colonial varieties of the language (Broadwell and Lillehaugen 2013). The syntax and grammar of these documents is also different from that of modern Zapotec languages. Thus potential users of such documents cannot read them without extensive training. The specialized knowledge required to read these Zapotec documents is not currently available in any centralized source, print or otherwise.

To address this problem, we have developed a digital text explorer, Ticha ( This project takes advantage of modern, digital modes of publication to make Colonial Zapotec texts accessible to members of the Zapoteccommunity, the general public, and scholars. Behind the scenes, it utilizes TEI and XML digital encoding practices, combining extant tools in novel ways.

In this talk we describe the Ticha interface and the interdisciplinary and international collaboration involved in the creation and expansion of this digital resource. The project relies on expertise in diverse but overlapping disciplines, including linguistics, ethnohistory, philology, and digital scholarship. Moreover, as the usership of the site is diverse in terms of language and education level, we also consider how to communicate understanding that results from detailed technical analysis to a lay audience-- and, moreover, how to do so in a multilingual platform.

The Ticha interface relies on methodological innovations in combining the encoding from two commonly used standards that are not currently being used in connected ways: the widely used TEI and the XML exported from Fieldworks Language Explorer (FLEx,, a database used by linguists that connects a lexicon with morphologically analyzed texts. TEI and FLEx independently allow for the encoding and tagging of data on many structural and semantic levels. However, each is also facilitates certain types of data analysis and, therefore, ways of searching and navigating patterns in the data and the diverse layers of information in these documents. For example, marking GPS coordinates of landmarks mentioned in the text and providing historical context for covert references to indigenous culture are easier done in the TEI, while representing the morphological complexity of verb forms is better done in FLEx. While there are a handful of other projects utilizing TEI for detailed linguistic analysis, we are unaware of any other project that seeks to connect these two commonly used standards.

The resulting digital edition presents the Colonial Zapotec texts in multiple connected layers: images of manuscripts are connected with transcriptions, translations, detailed morphological analysis, and cultural and historical notes. The materials themselves contribute to larger global conversations of indigeneity, post-colonialism, and academic responsibility to make findings accessible to the stakeholding communities.

