Raymond Hickey,
Department of English,
University of Essen,
Universitaetsstr. 12,
D-45117 ESSEN
Germany
Tel.    +49 201 183 3441
Fax.    +49 201 183 3437
E-mail. R.Hickey@hrz.uni-essen.de

LEXA: Corpus Processing Software

Update Documentation: Foreword and Introduction

Foreword

Since the distribution of the Lexa corpus processing software began in late 1992 a number of changes have been made by the present author. It has been an ongoing project and has benefited from many suggestions of colleagues who have been kind enough to communicate their reactions, frequently in the form of long lists of improvements which they envisaged. I hope that the present version incorporates the majority of these proposals and other changes besides so that those looking for corpus processing software feel they can turn in confidence to the present package.

The main organizational change made for Version 6.0 of Lexa involves memory management. There is now no restriction on the size of files which can be processed or the number of programmes which can be nested in each other, assuming that the computer you are running has enough physical memory available, say 8 MB. This re-organization has been rendered possible by employing the same memory management technique as that used by Microsoft Windows and which has the additional advantage of making Lexa compatible with this environment. Any or all of the programmes of the present suite can be started from the programme manager of Windows and still have enough memory for the processing of large files at their disposal.

Apart from this change in organization, most of the programmes of the Lexa suite have been thoroughly revised. This can be seen when using them: there are additional options, improved interfaces, better data handling. If you compare for instance the previous version of Lexa Context, the programme for finding syntactic contexts, with the current one then this will be immediately obvious. A full description of all additions is given in the update documentation below. This consists of sections on the programmes which have been enhanced. Users can hopefully recognize at a glance what new options are available.

Once again it is a pleasure for me to acknowledge the assistance and encouragement which I have received from the Norwegian Computing Centre for the Humanities which has been kind enough to publish this documentation and distribute the software. In particular my thanks go to Knut Hofland at the Computing Centre who has always been helpful and patient during the realization of this project.

Raymond Hickey                              Essen,
                                     December 1994

1 Introduction

Before describing in detail the alterations to the programmes of the Lexa suite it would seem opportune to offer a general characterization of the changes made.

Co-existence with Windows. The new version of Lexa works effortlessly together with Windows. It can have several megabytes of working memory for texts (the limit is your hardware not the Lexa software). Programmes can be nested one within the other without the slightest difficulty. All can be called from a (revised) single unified desktop, Lexa Desk. This is first loaded from Windows by specifying ldesk (by activating File and Run in the Programme Manager of Windows) and the system takes off from there.

Database management. A number of major changes affect the area of database management. The database manager of the suite now comes in two flavours: a full form DbStat for statistical computation on the basis of databases with numerical fields (as before), a form DbTxt with all functions except the statistical options. The generation of new databases has been assigned to a separate module DbMake which is loaded automatically from the desktop of both the database managers.

The database managers now allow the generation of customized numerical and/or subject indexes. An index can be created from a single database and combined with others to form a composite index.

Interfacing. The interface between data processing programmes and users' word processors has been expanded so that any data processed with a Lexa programme can be transferred to WordPerfect or Word without any difficulty.

The text-database interface has been considerably enlarged. The report form generator ReportDb takes care of this and the additional features are available in the database managers and the text editors.

A special text interface has been included in the database managers which permits the entry of free text of any length associated with the records of a database. This text can be formatted in DbStat/DbTxt, printed separately or transferred to your word processor, such as WordPerfect or Word, with little or no extra editing. Database document files can be linked together and the data in them is transferable to other databases. A typical application of this option would be to create a commented bibliography via a database with associated document file. If you want to work with bibliographical data, you should try the new programme DbBib which facilitates the processing of such data.

Customized fonts and data display. There are many revisions concerning fonts, especially in the retrieval software. Of particular interest for users of the Helsinki corpus is the addition of user-specifiable routines for sorting and searching. Test files are supplied which show how you can view, sort and search for Old English words/phrases with real Old English characters by converting the escape characters of the Helsinki corpus into real symbols and then using the supplied sort and search files (these put ash between 'ad' and 'af' and put thorn and eth in that order after 't' as well as treating yogh and 'g' as equal for retrieval purposes, etc.)

Special character processing has been further enhanced by the option of specifying a user font with both text editors and database managers which allows the full ASCII set and a further 256 symbols of your own. There is a sample user font LTXT_CHR.VGA included in the suite with which you can see for yourself how this option works. User fonts can be created and processed with the LinguaFont software by the author of Lexa which is also available from the Norwegian Computing Centre for the Humanities.

All programmes which involve data processing or display can be run in a 50-line VGA mode. You switch by simply pressing (or in the text editors).

New initialization files. Because of the increased number of features, many programmes have initialization files with more parameters than in the original version of the Lexa suite. If you work with Version 6.0 please ensure that you use the newer versions of initialization files (the easiest way to be certain that this is the case is to install the entire suite on your hard disk afresh). There are one or two new programmes like TxtLook, a flexible browser for ASCII files (like those of a corpus).

1.1 General remarks on using the Lexa suite

Let me remind users of the present software that the first matter they have to decide before using the programmes is just what goal they are trying to attain with the corpus data they intend to process. The task of tagging a corpus or portion of it is carried out by the main programme, called simply Lexa. It may well be, however, that this is not what one wishes to start with. Equally, if not more common, is the wish to retrieve information from an available corpus, tagged or not. In this case one should precede to the retrieval software of the package (see menu two in the programme Lexa Desk).

You may wish to convert textual data into a database for more accurate statistical processing. In this case one should look at the database management software. If you have not acquainted yourself with database management you might try doing so with the database processing software of Lexa. By using this kind of programme you can arrive at results not possible with text editing software.

It should be mentioned at this point that the Lexa suite does not constitute natural language processing software in the sense of programmes which make decisions on the grammatical status of input data. This must be done by the user of Lexa in advance (for instance when tagging a corpus or part of one). Lexa can only recognize categories which have been conveyed to it by the user.

1.2 Primary data processing

What is a text and how is it processed? The primary input data for Lexa are texts. At its simplest, a text is a series of lines, each of which ends with a pre-defined pair of characters. Corpora such as those of the ICAME Collection of English Language Corpora distributed by the Norwegian Computing Centre for the Humanities at Bergen are shipped in simple text form. Such texts can be processed directly by virtually all programmes of the Lexa group; the only exception to this rule are the database managers. Tasks such as tagging (automatic, manual or a combination of both), concordance generation, generating lexical density profiles, transferring of text to a database environment are achieved by the main programme Lexa.

Other programmes allow one to edit texts directly, to compare versions with one another, to gain relevant information on the structure of texts, or to view a supplied sample corpus from within a comfortable interface.

If you just want to see what type of tasks the Lexa suite can be used for you might care to try out the demonstration programme, Lexa Demo (choose option 'Run demonstration' in first menu of Lexa Desk while ensuring that you are located in the home directory of Lexa).

1.3 Retrieving information

What kinds of information retrieval are there? In the Lexa suite three main types of information retrieval are possible. The first is the simplest: you are looking for a string and want to know what files it occurs in (use: Lexa Search). The second is much more powerful: you wish to find all strings which match a partially specified input where you may also set a number of search parameters (use: Lexa Pat). Thirdly there is a programme which will allow you to search for syntactic contexts (use: Lexa Context). Here you specify a frame by entering an initial string (or part of one) and a terminal string and set various parameters such as where the frame is expected (line or sentence) how large it can be maximally, manual confirmation of potential finds, etc.

Retrieval involves texts in the main. However it is equally possible to carry out similar operations with databases. Furthermore retrieval can include the option of replacing finds with some user-specified string or strings.

Information gleaned from texts can be sensitive to the parameters and their values contained in Cocoa-style text file headers as provided for by the Helsinki corpus, for example.

1.4 Using databases

How to make the most of databases. Most linguists involved in corpus processing will have experience with a word processor and hence be oriented towards data in textual form. The potential to manipulate data within the confines of a text are quite limited so that it is strongly recommended that users of corpora consider the option of transferring part if not all of their data to a database environment for further processing.

Transfer to a database can be achieved most easily by availing of the inbuilt options in the programme Lexa (first option in first menu of Lexa Desk). It allows you to extract all unique word forms from a set of texts, keep a track of their frequencies, generate a reverse dictionary and carry out a variety of grammatical analyses specified by the user on the basis of formal criteria applying to the data contained in the output database.

Bear in mind the following rule of thumb: if you have large quantities of data which are structurally similar and which you wish to expand on and still have quick access to and to manipulate it easily, then you should maintain this data in a database environment. You can sort, extract, combine and filter data selectively in database form. And you can always re-export data back into a text environment if you want to (use database managers: DbStat or DbTxt).

1.5 File management

Looking after your files and your hard disk. Last but not least, users of corpora should pay attention to the maintenance of their data in file form on the hard disk of the computer. This is of particular relevance if one makes temporary copies of files (as one should) when experimenting with them or indeed if one is engaged actively in the task of corpus compilation. You should store different kinds of data in different directories on the hard disk and be careful to name files in a sensible manner so that one can later recognize from their operating system name what their contents are.

For your convenience, the Lexa suite puts at your disposal a number of powerful tools for file management. First and foremost one should mention the file manager Lexa File, one of the major programmes of the group. You can use it for all file maintenance and as a launching pad for processing tasks if you wish. To move around directories use Lexa Dirs and to gain a quick overview of the contents of a disk, use Lexa Browse. When locating files is an urgent need, use Lexa Find. Lastly remember that there is a configurable shell in the Lexa suite with which you can customize the type of operations you wish to carry out on your corpus data, the programmes you want to use, etc. (use: Lexa Shell).

================================================================

ORDER FORM (June 1995)

I hereby order (please tick one):

_____ Lexa 6.0 : Corpus Processing Software (for MS-DOS, with manuals) 

  Vol 1. Lexical Analysis and Information Retrieval, 303 pages
  Vol 2. Database and Corpus Management, 246 pages
  Vol 3. Utility Library, 210 pages
  Update Documentation, 223 pages

  7 diskettes (3,5" HD)

  Price                                                   NOK  850
  + postage (within Europe: NOK 110.-,
  Rest of the World: NOK 200.-)                           NOK
                                                          -----------           
  Total price                                             NOK    
                                                          ===========

_____ Lexa 6.0 : Corpus Processing Software (for MS-DOS, with update manual) 

  Update Documentation, 223 pages

  7 diskettes (3,5" HD)

  Price (postage included)                                NOK  350
                                                          ===========

_____ Lexa 6.0 : Corpus Processing Software (only update manual) 

  Update Documentation, 223 pages

  Price (postage included)                                NOK  300
                                                          ===========


Pre-payment is required

Way of payment:

__ Cheque (bank draft, cashier's cheque) made out in Norwegian kroner (NOK)
   (or the equivalent in English pounds or U.S. dollars) to:
   The Norwegian Computing Centre for the Humanities, Bergen, Norway.
   (1 USD is approximately 6.5 NOK, use the current exchange rate)
or

__  Bank account no. 7874 06 32077. SWIFT code: DNBANOKK

Return to:

  Norwegian Computing Centre for the Humanities,
  Allégt. 27,
  N-5007 Bergen,
  Norway 

  Fax: +47 55 58 94 70
  Email: icame@hit.uib.no


Name/
Institution:___________________________________________________________	

            ___________________________________________________________                	

Address:    ___________________________________________________________	

            ___________________________________________________________	

            ___________________________________________________________	

Tel.:       __________________________	

Fax:        __________________________	

E-mail:     ___________________________________________________________