Raymond Hickey, Department of English, University of Essen, Universitaetsstr. 12, D-45117 ESSEN Germany Tel. +49 201 183 3441 Fax. +49 201 183 3437 E-mail. R.Hickey@hrz.uni-essen.de
Since the distribution of the Lexa corpus processing software began in late 1992 a number of changes have been made by the present author. It has been an ongoing project and has benefited from many suggestions of colleagues who have been kind enough to communicate their reactions, frequently in the form of long lists of improvements which they envisaged. I hope that the present version incorporates the majority of these proposals and other changes besides so that those looking for corpus processing software feel they can turn in confidence to the present package.
The main organizational change made for Version 6.0 of Lexa involves memory management. There is now no restriction on the size of files which can be processed or the number of programmes which can be nested in each other, assuming that the computer you are running has enough physical memory available, say 8 MB. This re-organization has been rendered possible by employing the same memory management technique as that used by Microsoft Windows and which has the additional advantage of making Lexa compatible with this environment. Any or all of the programmes of the present suite can be started from the programme manager of Windows and still have enough memory for the processing of large files at their disposal.
Apart from this change in organization, most of the programmes of the Lexa suite have been thoroughly revised. This can be seen when using them: there are additional options, improved interfaces, better data handling. If you compare for instance the previous version of Lexa Context, the programme for finding syntactic contexts, with the current one then this will be immediately obvious. A full description of all additions is given in the update documentation below. This consists of sections on the programmes which have been enhanced. Users can hopefully recognize at a glance what new options are available.
Once again it is a pleasure for me to acknowledge the assistance and encouragement which I have received from the Norwegian Computing Centre for the Humanities which has been kind enough to publish this documentation and distribute the software. In particular my thanks go to Knut Hofland at the Computing Centre who has always been helpful and patient during the realization of this project.
Raymond Hickey Essen,
December 1994
Before describing in detail the alterations to the programmes of the Lexa suite it would seem opportune to offer a general characterization of the changes made.
Co-existence with Windows. The new version of Lexa works effortlessly together with Windows. It can have several megabytes of working memory for texts (the limit is your hardware not the Lexa software). Programmes can be nested one within the other without the slightest difficulty. All can be called from a (revised) single unified desktop, Lexa Desk. This is first loaded from Windows by specifying ldesk (by activating File and Run in the Programme Manager of Windows) and the system takes off from there.
Database management. A number of major changes affect the area of database management. The database manager of the suite now comes in two flavours: a full form DbStat for statistical computation on the basis of databases with numerical fields (as before), a form DbTxt with all functions except the statistical options. The generation of new databases has been assigned to a separate module DbMake which is loaded automatically from the desktop of both the database managers.
The database managers now allow the generation of customized numerical and/or subject indexes. An index can be created from a single database and combined with others to form a composite index.
Interfacing. The interface between data processing programmes and users' word processors has been expanded so that any data processed with a Lexa programme can be transferred to WordPerfect or Word without any difficulty.
The text-database interface has been considerably enlarged. The report form generator ReportDb takes care of this and the additional features are available in the database managers and the text editors.
A special text interface has been included in the database managers which permits the entry of free text of any length associated with the records of a database. This text can be formatted in DbStat/DbTxt, printed separately or transferred to your word processor, such as WordPerfect or Word, with little or no extra editing. Database document files can be linked together and the data in them is transferable to other databases. A typical application of this option would be to create a commented bibliography via a database with associated document file. If you want to work with bibliographical data, you should try the new programme DbBib which facilitates the processing of such data.
Customized fonts and data display. There are many revisions concerning fonts, especially in the retrieval software. Of particular interest for users of the Helsinki corpus is the addition of user-specifiable routines for sorting and searching. Test files are supplied which show how you can view, sort and search for Old English words/phrases with real Old English characters by converting the escape characters of the Helsinki corpus into real symbols and then using the supplied sort and search files (these put ash between 'ad' and 'af' and put thorn and eth in that order after 't' as well as treating yogh and 'g' as equal for retrieval purposes, etc.)
Special character processing has been further enhanced by the option of specifying a user font with both text editors and database managers which allows the full ASCII set and a further 256 symbols of your own. There is a sample user font LTXT_CHR.VGA included in the suite with which you can see for yourself how this option works. User fonts can be created and processed with the LinguaFont software by the author of Lexa which is also available from the Norwegian Computing Centre for the Humanities.
All programmes which involve data processing or display can
be run in a 50-line VGA mode. You switch by simply pressing
New initialization files. Because of the increased number of features,
many programmes have initialization files with more parameters than
in the original version of the Lexa suite. If you work with Version
6.0 please ensure that you use the newer versions of initialization
files (the easiest way to be certain that this is the case is to install the
entire suite on your hard disk afresh). There are one or two new
programmes like TxtLook, a flexible browser for ASCII files (like
those of a corpus).
Let me remind users of the present software that the first matter they
have to decide before using the programmes is just what goal they
are trying to attain with the corpus data they intend to process. The
task of tagging a corpus or portion of it is carried out by the main
programme, called simply Lexa. It may well be, however, that this is
not what one wishes to start with. Equally, if not more common, is
the wish to retrieve information from an available corpus, tagged or
not. In this case one should precede to the retrieval software of the
package (see menu two in the programme Lexa Desk).
You may wish to convert textual data into a database for more
accurate statistical processing. In this case one should look at the
database management software. If you have not acquainted yourself
with database management you might try doing so with the database
processing software of Lexa. By using this kind of programme you
can arrive at results not possible with text editing software.
It should be mentioned at this point that the Lexa suite does not
constitute natural language processing software in the sense of
programmes which make decisions on the grammatical status of input
data. This must be done by the user of Lexa in advance (for instance
when tagging a corpus or part of one). Lexa can only recognize
categories which have been conveyed to it by the user.
What is a text and how is it processed? The primary input data for
Lexa are texts. At its simplest, a text is a series of lines, each of
which ends with a pre-defined pair of characters. Corpora such as
those of the ICAME Collection of English Language Corpora
distributed by the Norwegian Computing Centre for the Humanities
at Bergen are shipped in simple text form. Such texts can be
processed directly by virtually all programmes of the Lexa group; the
only exception to this rule are the database managers. Tasks such as
tagging (automatic, manual or a combination of both), concordance
generation, generating lexical density profiles, transferring of text to
a database environment are achieved by the main programme Lexa.
Other programmes allow one to edit texts directly, to compare
versions with one another, to gain relevant information on the
structure of texts, or to view a supplied sample corpus from within
a comfortable interface.
If you just want to see what type of tasks the Lexa suite can be used
for you might care to try out the demonstration programme, Lexa
Demo (choose option 'Run demonstration' in first menu of Lexa Desk
while ensuring that you are located in the home directory of Lexa).
What kinds of information retrieval are there? In the Lexa suite three
main types of information retrieval are possible. The first is the
simplest: you are looking for a string and want to know what files it
occurs in (use: Lexa Search). The second is much more powerful:
you wish to find all strings which match a partially specified input
where you may also set a number of search parameters (use: Lexa
Pat). Thirdly there is a programme which will allow you to search for
syntactic contexts (use: Lexa Context). Here you specify a frame by
entering an initial string (or part of one) and a terminal string and set
various parameters such as where the frame is expected (line or
sentence) how large it can be maximally, manual confirmation of
potential finds, etc.
Retrieval involves texts in the main. However it is equally
possible to carry out similar operations with databases. Furthermore
retrieval can include the option of replacing finds with some
user-specified string or strings.
Information gleaned from texts can be sensitive to the
parameters and their values contained in Cocoa-style text file headers
as provided for by the Helsinki corpus, for example.
How to make the most of databases. Most linguists involved in
corpus processing will have experience with a word processor and
hence be oriented towards data in textual form. The potential to
manipulate data within the confines of a text are quite limited so that
it is strongly recommended that users of corpora consider the option
of transferring part if not all of their data to a database environment
for further processing.
Transfer to a database can be achieved most easily by availing
of the inbuilt options in the programme Lexa (first option in first
menu of Lexa Desk). It allows you to extract all unique word forms
from a set of texts, keep a track of their frequencies, generate a
reverse dictionary and carry out a variety of grammatical analyses
specified by the user on the basis of formal criteria applying to the
data contained in the output database.
Bear in mind the following rule of thumb: if you have large
quantities of data which are structurally similar and which you wish
to expand on and still have quick access to and to manipulate it
easily, then you should maintain this data in a database environment.
You can sort, extract, combine and filter data selectively in database
form. And you can always re-export data back into a text
environment if you want to (use database managers: DbStat or
DbTxt).
Looking after your files and your hard disk. Last but not least, users
of corpora should pay attention to the maintenance of their data in
file form on the hard disk of the computer. This is of particular
relevance if one makes temporary copies of files (as one should)
when experimenting with them or indeed if one is engaged actively
in the task of corpus compilation. You should store different kinds of
data in different directories on the hard disk and be careful to name
files in a sensible manner so that one can later recognize from their
operating system name what their contents are.
For your convenience, the Lexa suite puts at your disposal a
number of powerful tools for file management. First and foremost one
should mention the file manager Lexa File, one of the major
programmes of the group. You can use it for all file maintenance and
as a launching pad for processing tasks if you wish. To move around
directories use Lexa Dirs and to gain a quick overview of the
contents of a disk, use Lexa Browse. When locating files is an urgent
need, use Lexa Find. Lastly remember that there is a configurable
shell in the Lexa suite with which you can customize the type of
operations you wish to carry out on your corpus data, the
programmes you want to use, etc. (use: Lexa Shell).
1.1 General remarks on using the Lexa suite
1.2 Primary data processing
1.3 Retrieving information
1.4 Using databases
1.5 File management
================================================================
ORDER FORM (June 1995)
I hereby order (please tick one):
_____ Lexa 6.0 : Corpus Processing Software (for MS-DOS, with manuals)
Vol 1. Lexical Analysis and Information Retrieval, 303 pages
Vol 2. Database and Corpus Management, 246 pages
Vol 3. Utility Library, 210 pages
Update Documentation, 223 pages
7 diskettes (3,5" HD)
Price NOK 850
+ postage (within Europe: NOK 110.-,
Rest of the World: NOK 200.-) NOK
-----------
Total price NOK
===========
_____ Lexa 6.0 : Corpus Processing Software (for MS-DOS, with update manual)
Update Documentation, 223 pages
7 diskettes (3,5" HD)
Price (postage included) NOK 350
===========
_____ Lexa 6.0 : Corpus Processing Software (only update manual)
Update Documentation, 223 pages
Price (postage included) NOK 300
===========
Pre-payment is required
Way of payment:
__ Cheque (bank draft, cashier's cheque) made out in Norwegian kroner (NOK)
(or the equivalent in English pounds or U.S. dollars) to:
The Norwegian Computing Centre for the Humanities, Bergen, Norway.
(1 USD is approximately 6.5 NOK, use the current exchange rate)
or
__ Bank account no. 7874 06 32077. SWIFT code: DNBANOKK
Return to:
Norwegian Computing Centre for the Humanities,
Allégt. 27,
N-5007 Bergen,
Norway
Fax: +47 55 58 94 70
Email: icame@hit.uib.no
Name/
Institution:___________________________________________________________
___________________________________________________________
Address: ___________________________________________________________
___________________________________________________________
___________________________________________________________
Tel.: __________________________
Fax: __________________________
E-mail: ___________________________________________________________