Support for Arabic Texts in the Archimedes Project

Challenges

The Archimedes project requires a general strategy for incorporating documents in the Arabic language into our digital research library. These documents pose a number of special challenges, many of which arise from the complexity of the Arabic script. In order to work with these texts, we have had to extend our tools to support Arabic script, and to develop means for extracting documents from a variety of legacy formats and bringing them into a single standardized representation.

Software that renders Arabic must be capable of dealing with the complexity of the script. Any character can appear in up to four different forms (glyphs) depending on its context — that is, whether it occurs in isolation, or in word-initial, -medial, or -final position. Further, Arabic text is written from right to left; software that renders Arabic together with text in another language (such as English or Greek) must be sensitive to context-dependent properties of script directionality.

Prior to the development of Unicode, a large variety of incompatible schemes existed for the encoding of Arabic text. Many of these were proprietary and tied to the ad hoc arrangement of a particular font. Documentation of such schemes is limited at best.

Strategy

In order to deal with these challenges, we have developed software for converting data into our standardized format. Further, we have extended our core software tools to deal with the Arabic language and writing system.

Data Format

All textual data in the Archimedes project are stored as XML. For encoding Arabic, we use the Buckwalter transliteration (chart in PDF format), in which standard Arabic orthography is represented using only ASCII characters. There is a one-to-one mapping between the Buckwalter transliteration and the Arabic code block of Unicode.

We have developed a system for converting legacy documents from the 1980s-era Macintosh program al-Kaatib (which no longer works reliably on modern Macintosh systems). In the near future, we intend to make this software available for public use over the World Wide Web.

Core Tools

The Arboreal XML browser is a powerful and general tool developed by the Archimedes Project for content-based access to, and annotation of, XML texts. Arboreal provides built-in support for a number of languages, including Greek, Arabic, and several languages written in cuneiform script.

Arabic language support in Arboreal allows the user to toggle immediately between different views of the text: Arabic script (voweled or unvoweled), a standard Romanization, and the native document format (Buckwalter). The user may select whether the Arabic text is to be fully voweled, or left unvoweled.

A key feature of Arboreal is dynamic access to morphological and terminological data. Each word in the text is automatically linked to a morphological analysis and to entries in one or more dictionaries. These facilities are already available for six languages, and we anticipate providing them for Arabic within the next two months. We are presently working on integrating Arabic morphological analysis into our unified framework (called Donatus). We have acquired a machine-readable text of H. Anthony Salmoné's An Advanced Learner's Arabic-English Dictionary (1889). We are also currently in the process of digitizing the three-volume Arabic-English Lexicon of Edward William Lane (1863-1893). Both of these dictionaries will be integrated into our Pollux system, which offers direct access to dictionary entries, either through a web browser or within Arboreal. (See a screenshot of the Salmoné dictionary from within Arboreal.)

In addition to the automatic morphology described above, Arboreal also allows the user to annotate technical terms or phrases. These may be grouped into different lists and highlighted in the document with various colors. In this screenshot of Hero's Mechanics the term quwwa "power" is boxed in blue. By clicking on the highlighted term, the user can open Arboreal's term editor, which provides further information and the opportunity for annotation. The term editor, of course, may be used with any of the views (Arabic script, Romanization, etc.). (See another screenshot of the term editor on OS X.)

Arboreal provides several methods for typing Arabic text.


webmaster@archimedes.fas.harvard.edu
Last modified: Sun Jul 6 18:28:13 EDT 2003

Valid HTML 4.01!