The current paradigm of the web — in which the user browses, leaving behind a clicktrail that is of interest primarily to marketers — falls far short of the needs of scientists and scholars. Browsing the web is scarcely more interactive than surfing television channels. True interactivity — which will allow the web finally to achieve its potential as a medium for scholarly, political, and social dialogue — demands something other than the current browser/server paradigm. New tools will be needed, whose developers recognize that information consumers are also information producers. Scholarship is an inherently recursive activity, in that the scholar uses existing scholarship to produce new scholarship. Knowledge undergoes a process of accretion, akin to the formation of a pearl; one exemplary model is a page of the Talmud, on which there is a hierarchical arrangement of commentary, super-commentary, annotation, and cross-reference that spreads from center to margin.
Information production is possible within the web browser of today — using Wikis, or content management systems such as Zope. But these tools seem like primitive intruders in an environment that was engineered primarily for publication. True interactivity demands a new tool: not a browser, but an interagent. With these ideas in mind, for the past few years, the Max-Planck-Institut für Wissenschaftsgeschichte has been developing, in collaboration with Harvard University, a prototype interagent called Arboreal. Arboreal allows for flexible, non-linear navigation of arbitrary XML documents and for granular annotation of these documents down to the word- or term-level. Annotations themselves are XML data, which can be shared, published, and further annotated in turn.
Natural language is the primary means by which humans communicate — though it is supplemented, of course, by formal languages and other symbolic systems and by pictures and other audio-visual media. Yet today's web browsers provide only the crudest tools to support natural language documents. Most linguistic support in browsers is focused on visual presentation of text in some writing system. Even in this area, the technology comes up short: what browsers can properly render Chinese or Mongolian in their traditional vertical layouts or can adequately deal with Japanese ruby?1 Beyond display, browsers also typically allow for the searching of text — but again only in an unsophisticated and inflexible way, which is of limited value even for most western European languages, and is thoroughly inadequate for highly inflected languages or languages written in complex scripts.
Tomorrow's interagents must provide more sophisticated linguistic capabilities: language technology must be available from within the interagent. Yet this is not to say that language technology should be built in to the interagent; such a monolithic approach can only fail users from complex and diverging linguistic, ethnic, and professional backgrounds, and with equally heterogeneous needs and interests. Rather, a services-oriented architecture is needed, in which the interagent can communicate with linguistic web services (via, for example, XML-RPC or SOAP) and dynamically acquire new linguistic behaviors (via the dynamic class-loading mechanisms offered by frameworks such as Java or .NET). Again, Arboreal has been designed to implement these techniques: via web services it can acquire morphological data that allow for lemmatized searching, lexicon lookup, and other language-based functions. In addition, it supports pluggable language behaviors that allow for dynamic transliteration of writing systems such as Arabic, Greek, and Chinese and for orthographically-normalized searching that renders the spelling peculiarities of (e.g.) early modern texts transparent to the user. These are critical and basic functions that will serve as the foundation in the future for a richer set of facilities, including term and keyword discovery, language-neutral searching based on concepts rather than words, automatic summarization, and sophisticated semantic linking.
back to main Arboreal page
Unicode et typographie: un amour impossible?, Document Numérique 3/4 (2002), 105-37. Ruby annotation is dealt with in a W3C recommendation: http://www.w3.org/TR/ruby. [main text]