Standardization and Automation in the Production and Management of Electronic Text Resources

Milena Slavcheva and Boyanka Zaharieva
Bulgarian Academy of Sciences
Linguistic Modelling Department
25A, Acad. G. Bonchev St.
1113 Sofia, Bulgaria
milena@lml.acad.bg
boby@bgcict.acad.bg


The existence of a collection of electronic texts appropriately represented and organised is a necessary condition for a given natural language to join the global network of use and interchange of language resources.

Faced with the task of creating a large text corpus of Bulgarian, we had to settle two important issues: 1) the standardized representation of the electronic texts, and 2) the automation of the production of text documents in a standard format.

The adherence to internationally established standards for encoding text data has turned into a requirement in the construction of text databases. SGML (ISO 8879: 1986, Information Processing - Text and Office Systems - Standardized Generalized Markup Language) has become the generally preferred standard for the definition of device-independent and system-independent methods for representing texts in electronic form [1]. In the Bulgarian corpus, we specify the markup scheme for different types of texts using an application of SGML, i.e., the Corpus Encoding Standard (CES) [2] which is an extraction and at the same time an extension of the Guidelines for Electronic Text Encoding and Interchange of the Text Encoding Initiative (TEI P3) [3]. CES is developed to serve especially the needs of corpus-based work in Natural Language Processing.

In order to facilitate and speed up the complex and time-consuming process of production of encoded texts, we developed a software tool for automatic text marking and for management of an electronic textual database. The tool is a system called TENCO (Text ENCOder).

A special typology of texts has been worked out in such a way as to ensure the automatic recognition of the text structure and the assignment of appropriate tags to the respective elements. On entering the database of the system, texts are given a type label determined by their literary form and specific structural variant.

The encoding is conformant to CES Level 1, i.e., the gross structure of texts is marked including division head elements (opener, head, byline, docAuthor, etc.), division closer elements (closer, byline), and paragraph level elements (paragragh, spoken paragraph, quotation, line, line group, list, note, stage, speaker, etc.). The document type declaration of some sorts of texts is expanded to include several elements from the inventory of the TEI. The expansion is imposed by the existence of text features that are not stipulated in CES.

The metatextual section of the text document, the header, is predefined in the form of templates which are instantiated during the operation of the different components of the system. This particular formation of the header is a substantial facility to the human encoder.

The system module for database management provides a friendly interactive mode for the introduction of identificational and bibliographic information about the electronic text, and for queries to the database about qualitative and quantitative specifications of the textual data.

TENCO produces text documents that conform to the modern and well-established encoding practice in Europe and the USA. The evidence for this is the mapping of the markup scheme of our encoded texts to the results of the validation of language corpora carried out by the ELRA Corpus Validation Group [4].

The system fulfills the beneficial for Bulgarian task of producing a text corpus which adheres to the internationally defined requirements for corpora construction and representation. With slight modifications, TENCO can be applied to other languages as well, thus turning into a multilingual tool for the production of language resources.

The system will be developed further to mark up texts at CES Levels 2 and 3, i.e., at the sub-paragraph level, and at the level of morphosyntactic annotation. The results will be connected to other implementations in the framework of language technology.



 
References
[1] A Gentle Introduction to SGML. (http://etext.virginia.edu/TEI.html)

[2] Corpus Encoding Standard. (http://www.cs.vassar.edu/CES/)

[3] TEI Guidelines for Electronic Text Encoding and Interchange. (http://etext.virginia.edu/TEI.html)

[4] Baker P., L. Burnard, A. McEnery, A. Wilson. An analytic framework for validation of language corpora. Report for the ELRA Corpus Validation Group, 1997. (http://www.icp.grenet.fr/ELRA/valid/wman/)