Next: The Donatus Morphology Document
Up: The Donatus XML-RPC Interface
Previous: API Methods
Linguistic data sent to Donatus for analysis via the XML-RPC interface
must be submitted in the WTAG format. This format is designed
(1) to abstract from arbitrary document structure, (2) to provide data
that will be needed for various statistical information processing
purposes, (3) to allow for explicit segmentation (tokenization at the
word level), and (4) to allow for orthographic normalization of the
underlying data.
A WTAG document has four levels:
- The root is the element <wtag>. This element takes an
obligatory attribute, locator, which specifies a canonical
URI for the source document. The scheme of locators (or
abstract URIs) employed in the Archimedes project is described in
the document ``The Arboreal Catalog System.''
- The language section level: the document contains one
section for each natural language that occurs in the document. Thus,
a document that contains Latin text will possess the second-level
tag <section lang="la">.
- The container level: the container is a concept that
represents the primary semantic unit of interest to users of a
particular document type. This unit is discussed in section 2 of the
document ``The Arboreal Docspecs System.'' For most documents and
users, the container will be a sentence or
sentence-like unit. The most frequent container element in
the Archimedes DTD is <s>, which tags a sentence. Any
allowable container for the relevant document type may appear at the
container level. Optionally, the container element may take an
id attribute with its value as an XML ID value,
which is unique within the scope of the document. These IDs may be
useful to information processing tools other than Donatus. Donatus
uses them in the case of backends that support contextual
morphological identification; in this situation, Donatus needs an ID
in order to construct an XPointer expression that refers to the
container within which a contextually-sensitive form is identified.
- The word level: beneath the container are any number of
words, orthographically normalized, and tagged as <w>.
Example: <s id="Lucr.1.1">Aeneadum genetrix hominum divomque
voluptas ...
</s> is represented in
WTAG as:
<s id="Lucr.1.1">
<w>Aeneadum</w>
<w>genetrix</w>
<w>hominum</w>
<w>divomque</w>
<w>voluptas</w>
...2
</s>
In the case that a container has embedded text in another
language, that material must appear at the word level under the
appropriate section for the language in question, with the appropriate
container elements and IDs (possibly repeated from another language
section).
Next: The Donatus Morphology Document
Up: The Donatus XML-RPC Interface
Previous: API Methods
Malcolm D. Hyman
2004-04-07