Over the years, the members of the Xhosa Department have been involved in various research projects ' see the Staff pages and the Products page for some examples. In addition to their own projects, all members are currently involved in one overarching project: the eXe-Files. This project is presented here.
Project title: The eXe-Files ' An innovative electronic Xhosa corpus to boost new research and the creation of modern tools
Auspices: Supported by UWC's Department of Research Development, and co-sponsored by an anonymous Publisher and TshwaneDJe HLT.
Period: Start: mid-2006 / Stop: end-2009
Principal researcher: G-M de Schryver (Prof.)
Other project leader: S.J. Neethling (Prof.)
Other co-researchers: T.V. Mabeqa (Ms.); L.K. Mletshe (Mr.); N.L. Mpolweni (Ms.); T. Ntwana Mgijima (Ms.); N. Skade (Mr.); A. van Huyssteen (Mrs.)
Research assistant: S. Dlamini (Mrs.)
Introduction: Computational Linguistics
Computational linguistics is all around us today, although we do not always realise this: a web search engine such as Google, spellcheckers, electronic dictionaries, automated translation from one language into another, and so on, all make use of the results from this field. In nearly all instances, an electronic corpus of language data is one of the core components. In this project, the intention is to build an innovative Xhosa corpus, the eXe-Files, which will then enable to research the language in a new way, and to produce the first ICT products for and in Xhosa.
'X' stands for the language being worked on and with, Xhosa, while 'exe-files' stands for the tangible outcomes, metaphorically seen as 'executables'.*
* From Wikipedia: ".exe is the common filename extension for denoting an executable ... An executable or executable file, in computer science, is a file whose contents are meant to be interpreted as a program by a computer."
Feasibility: Other South African Examples
Over the past few years, the feasibility of both corpus-based research and corpus-based products has already been proven for a number of South African languages. In 2006, for example, the corpus-based study 'Locative trigrams in Northern Sotho, preceded by analyses of formative bigrams' was published in Linguistics (44/1: 135-193), the top journal of the field, by G-M de Schryver and E. Taljard. This clearly indicates that this new approach to fundamental studies in linguistics can truly reach the international scene, and as a by-product, the African languages are given the widest possible international exposure.
As an example of products, in 2003, corpus-based spellcheckers, commissioned by the Department of Arts and Culture (DAC), were compiled and released by D.J. Prinsloo and G-M de Schryver for all official South African languages. The release of those tools was accompanied by a number of research articles published in the local accredited journals.
Goal: Uplift Xhosa
Over the past decade, the focus in South Africa has primarily been on Northern Sotho and Zulu. The time has now come to uplift Xhosa within the broad field of computational linguistics. In this regard the project team is very fortunate indeed, as it consists of no less than five mother-tongue speakers ' some of whom have published novels in Xhosa with top publishers such as Oxford University Press. Three language specialists complete the team.
The eXe-Files: General Outline
The first step in the project is to bring the following types of data together: (1) existing electronic files in Xhosa that are available from the project members, (2) freely available Internet data in Xhosa, and (3) selected sections from existing published material in Xhosa. These data are processed according to the very latest principles in corpus building, with the aim to produce a corpus that is both balanced and representative of the language. Although the aim is to reach 'ten million running words of text' (the 'tokens'), these data are only processed computationally, at which point the original format of the material is transformed into what one can view as mostly a set of language statistics, among them a much smaller number of different/unique orthographic words in the corpus (known as 'types').
In the second step these statistics and the small contexts around them are used to study the Xhosa language in a new way, within a corpus linguistics framework. Basically, all traditional language fields are eligible for study, consequently becoming 'corpus-based literature studies', 'corpus-based education studies', 'corpus-based translation studies', and so on.
The fields just mentioned are indeed also envisioned as research fields. Other fields that are already being researched are the localisation principles into Xhosa (both from a linguistics as well as a cultural perspective), automated part-of-speech tagging (contrasting, among others, machine-learning techniques with finite-state morphological analyses), and lemmatisation approaches (needed in lexicography).
Linking research with tools, all of the following are envisioned as practical outcomes: a new approach to comparative Xhosa literature, the proposal of new (corpus-based) techniques for outcomes-based and task-oriented Xhosa education, automatic Xhosa term extractors, localised Xhosa software, supervised and unsupervised POS-taggers for Xhosa, and finally a new type of corpus-based Xhosa dictionary.
Publications in each of those fields, as well as on the corpus-building process itself, are being prepared. The eXe-Files, in short, provides the team with the building blocks to be a respected international player in computational linguistics, and this while placing the spotlight on Xhosa
The eXe-Files: Sub-fields
From 2007 onwards, the project has been split along the lines of the following sub-fields, each with its own manager:
Corpus Building: G-M de Schryver
Lemmatisation and POS-tagging: S.J. Neethling
Localisation and Terminology: N. Skade
Spellchecking and (Machine) Translation: T. Ntwana
Lexicography: A. van Huyssteen