next up previous
Next: The Donatus Morphology Document Up: The Donatus XML-RPC Interface Previous: API Methods

The Donatus WTAG Document Type

Linguistic data sent to Donatus for analysis via the XML-RPC interface must be submitted in the WTAG format. This format is designed (1) to abstract from arbitrary document structure, (2) to provide data that will be needed for various statistical information processing purposes, (3) to allow for explicit segmentation (tokenization at the word level), and (4) to allow for orthographic normalization of the underlying data.

A WTAG document has four levels:

  1. The root is the element <wtag>. This element takes an obligatory attribute, locator, which specifies a canonical URI for the source document. The scheme of locators (or abstract URIs) employed in the Archimedes project is described in the document ``The Arboreal Catalog System.''

  2. The language section level: the document contains one section for each natural language that occurs in the document. Thus, a document that contains Latin text will possess the second-level tag <section lang="la">.

  3. The container level: the container is a concept that represents the primary semantic unit of interest to users of a particular document type. This unit is discussed in section 2 of the document ``The Arboreal Docspecs System.'' For most documents and users, the container will be a sentence or sentence-like unit. The most frequent container element in the Archimedes DTD is <s>, which tags a sentence. Any allowable container for the relevant document type may appear at the container level. Optionally, the container element may take an id attribute with its value as an XML ID value, which is unique within the scope of the document. These IDs may be useful to information processing tools other than Donatus. Donatus uses them in the case of backends that support contextual morphological identification; in this situation, Donatus needs an ID in order to construct an XPointer expression that refers to the container within which a contextually-sensitive form is identified.

  4. The word level: beneath the container are any number of words, orthographically normalized, and tagged as <w>.

Example: <s id="Lucr.1.1">Aeneadum genetrix hominum divomque voluptas ...
</s>
is represented in WTAG as:

<s id="Lucr.1.1">
<w>Aeneadum</w>
<w>genetrix</w>
<w>hominum</w>
<w>divomque</w>
<w>voluptas</w>
...
2
</s>

In the case that a container has embedded text in another language, that material must appear at the word level under the appropriate section for the language in question, with the appropriate container elements and IDs (possibly repeated from another language section).


next up previous
Next: The Donatus Morphology Document Up: The Donatus XML-RPC Interface Previous: API Methods
Malcolm D. Hyman 2004-04-07