We are currently exploring various automatic approaches to the analysis of document content. We have defined a test corpus of 16 texts in mechanics from medieval and early Renaissance times, including the entire contents of the Filemaker Database Latin Texts, Galileo's De Motu Antiquiora, Guidobaldo del Monte's Liber Mechanicorum (1577) and In Duos Archimedis Aequeponderantium libros paraphrasis (1588), and Biancani's Aristotelis Loca Mathematica (1615). The texts are first minimally tagged in XML and are then run through the orthographic normalization modules of Arboreal. This yields a body of orthographically normalized word lists which can be run through the morphological analyzer, yielding frequency results for all lexical (i.e. dictionary) forms in all the texts.
We next compute the tf*idf score for all of these lexical forms in all the texts. This score is defined as follows (view slide). For each term i in a document j, we compute the term frequency of i in j, i.e. the number of occurrences of i in j (tf i,j). We then compute the inverse document frequency of term i by taking the log of the ratio of N, the number of documents in our corpus (16 in this case), to the document frequency of i, i.e. the number of documents in which term i occurs (dfi). A term which occurs in all documents will have inverse document frequency 0. A term which occurs very often in one document but in very few documents of the corpus will have a high inverse document frequency and thus a high tf*idf score, and is thus a strong candidate for being a term that characterizes the content of the document.
We next represent each document by a vector made up of the tf*idf score wi for each term i. (view slide) By comparing these vectors, we can arrive at a quantitative measure of the similarity of two texts to one another. The most interesting metric for these purposes seems to be the cosine similarity measure, defined as the dot product of the two vectors divided by the product of the length of the vectors. (view slide) The cosine similarity measure of two texts varies from 0 (no terms in common) to 1 (every term in both texts has the same tf*idf score).
Though still extremely crude, this approach has yielded interesting results. For example, in the first slide, one can see that Archimedes De Centris Gravium is least similar to the Mechanical Problems of Pseudo-Aristotle (in the Tomeo translation) according to the cosine similarity measure. This is in accord with the traditional division of mechanics into an Aristotelian tradition concerned with dynamics and an Archimedean tradition concerned with statics. Futhermore, it is notable that Biancani, which contains exegesis of many passages from the Mechanical Problems, is most similar to Pseudo-Aristotle, followed closely by Galileo's De Motu Antiquiora. In the second slide, we see that Archimedes De Centris Gravium is in fact not very simlar to any other text in our corpus; but it is notable that the text most similar to it is, as we might expect, Guidobaldo del Monte's In Duos Archimedis Aequeponderantium libros paraphrasis (1588) (designated as "MonteComment" in the graph). In the final slide, we may observe that the Jordanus texts form a relatively coherent group to the right.
We hope that refinement of these methods will enable us to produce a meaningful classification of documents in the Archimedes Project automatically. Three areas in particular call for further exploration: (1) We should evaluate the effect of including only a select number of lexical forms in the document vectors, say the top 10 or 20 occurring forms in each text. (2) We need to construct vectors of terms as well as documents in order to determine which terms tend to be characteristic of the same documents. This is a first step towards automatic thesaurus acquisition. (3) We need to apply the method to a large corpus of hand-tagged terms, which we have already in the case of the Latin and Arabic texts in the Filemaker databases.
Mark Schiefsky
Malcolm Hyman