MANUAL OF INFORMATION
TO ACCOMPANY
THE KOLHAPUR CORPUS OF
INDIAN ENGLISH, FOR USE WITH
DIGITAL COMPUTERS


BY
S. V. SHASTRI
IN COLLABORATION WITH
C. T. PATILKULKARNI
GEETA S. SHASTRI

DEPARTMENT OF ENGLISH
SHIVAJI UNIVERSITY, KOLHAPUR
416004 INDIA

1986

TO
PROFESSOR S. K. VERMA

PREFACE

The present corpus of Indian English was conceived in Lancaster in 1978 when the main author of this work was researching under the supervision of Professor G.N. Leech. On his return to India he started the project with an initial grant from the Shivaji University in 1980, and carried it forward with a substantial financial assistance from the U.G.C. supplemented by support from various other sources including personal funds.
We gratefully acknowledge the financial assistance of Shivaji University and the University Grants Commission. We would like to acknowledge the support given by Chhatrapati Shahu Institute’s Vasantdada Professional Computer School in the form of access to their computer system at nominal charges, thanks to the kindness of their EDP manager, S.M.Kori. We also would like to thank the large number of people who have worked on the project at various stages. Among them: V.V. Badve and P.R. Kher who were associated with the project in the initial stages of collection of samples. J.A. Shinde and D.N. Kulkarni who assisted in proofreading of some of the texts.
Geeta S. Shastri for her secretarial assistance and together with C.T. Patilkulkarni for sharing the bulk of pre-editing and proofreading of texts.
Sumati Salunkhe, S.T. Shingate, B.N. Patil, Raju Chougule. A.Y. Shinde and Ganesh Surve who helped with secretarial and proofreading assistance at various stages.
Special thanks are due to Professor S.K. Desai, Head, Department of English, Shivaji University, for his encouragement and guidance also in his capacity as member of the Advisory Committee. We thank the other members of the Committee Professor Birje-Patil of M.S. University of Baroda, Professor C.J. Daswani of the University of Poona and Professor K. Subramanian of Central Institute of English and Foreign Languages for their unfailing support.
We are also grateful to Professor G.N. Leech, University of Lancaster, Professor W. Nelson Francis and Professor Henry Kucera, both of Brown University and Professor Stig Johansson, University of Oslo for their guidance from time to time throughout the duration of the project, and to Prof. Ramesh Mohan the former Director of the CIEFL and Professor R.N. Ghosh also of the CIEFL, for their continued support and help.
And in conclusior, we would like to thank all the copyright holders, who allowed their texts to be included, free of charge, in the corpus.

Kolhapur December, 1986. S. V. Shastri

CONTENTS

Introduction
Sources and Sampling Techniques

Distribution of the material

The American, British and Indian Corpora compared

Coding key

The material and its organization

Basic Technical Information

Note on copyright

References

List of Text Extracts

Introduction

A systematic and comprehensive description of Indian English is now overdue. Of the major national varieties of English, only the American and the British English have so far been described in some detail though several other varieties have already been indentified among the native speaker varieties. Side by side, some non-native varieties of English have also been tacitly recognized among which Indian English is a major one.

Studies of Indian English so far have been confined mainly to aspects of spoken English such as by Bansal of CIEFL, Bansal (1969), and methodological considerations such as by Kachru of Illinois University Kachru (1961, 1979), although Kachru himself has written on many isolated areas of Indian English, Kachru (1965, 1975, 1981)1. Descriptions of aspects of Indian English wherever they have appeared have been based on selected or ‘available’ samples such as Desai (1974) and Kachru (1965, 1975), the largest one so far being Nihalani et al (1979). There is no gainsaying that a comprehensive description will have to be based on a standard corpus.

The present corpus of Indian Written English2 is comparable to the Brown and the LOB corpora. It is intended to serve as source material for comparative studies of American, British and Indian English which in its turn is excepted to lead to comprehensive description of Indian English.

Although the Indian Corpus is planned to be comparable to the Brown and the LOB corpora there are some important differences dictated mainly by logistic and practical considerations. Firstly, as far as synchronicity is concerned there is a major departure in that, while the Brown and LOB corpora draw their samples from the materials published in the calendar year 1961 the Indian corpus is drawn from materials published in the year 1978. this decision was made after consultation with authors of the earlier corpora to make sure that the comparability will not suffer much as result of this.3 On the other hand it is felt that the value of the Indian corpus is immensely enhanced in general and in particular as a source for the description of Indian English as the Independence as the Indianness of Indian English is a post-Independence phenomen and may have reached a descernible stage in the thirty years after Independence. It is argued in theory that in the same thirty years the American and British English may not have undergone such changes. The number of texts, the weightage given to different genres of material and sampling procedures are kept very close to the other two corpora. However, in one part, that is of Imaginative Prose there are differences in respect of the kinds of fiction and the proportion of texts representing books and those representing periodicals. It is not surprising to find that the amount and kind of imaginative writings in a second language situation such as in India is very different from that in a first language one such as the American or the British situation. Inspite of including samples from all the available full length novels, the proportion could not come anywhere near those of the LOB or the Brown Corpus. Some of us felt that the weightage would be reduced in order to reflect at least in part the real situation. But it was argued that this might adversely affect comparability and the value of the corpus for its projected purpose.

Sources and Sampling Techniques

The Indian Corpus is intended to be a representative corpus of sample texts printed and published in 1978. The texts were largely selected by stratified random sampling process. The composition of texts in the Indian corpus as compared to those in the other two corpora is given in Table No. 1.

Table 1: The Basic Composition of American, British and Indian Corpora.

Text Categories

No. of texts in each category

 

American
Corpus

British
Corpus

Indian
Corpus

A Press: reportage

44

44

44

B Press: editorial

27

27

27

C Press: reviews

17

17

17

D Religion

17

17

17

E Skills, Trades and Hobbies

36

38

38

F Popular lore

48

44

44

G Belles Lettres

75

77

70

H Miscellaneons (Govt. Documents,
foundation reports, industry reports,
College catalogue, industry house organ).

30

30

37

J Learned and scientific writings

80

80

80

K General fiction

29

29

58

L Mystery and detective fiction

24

24

24

M Science fiction

6

6

2

N Adventure (Western fiction)

29

29

15

P Romance and love story

29

29

18

R Humour

9

9

9

Total

500

500

500

The sampling procedure followed is described below:

Books: While the compilers of the Brown and the LOB corpora had at their disposal ready bibliographies from which they could sample, we were handicapped in this regard. The Indian National Bibliography - INB monthly - lists all the publications received in the National Library under the provisions of the Delivery of Books, (Public Libraries) Act of 1954 as amended by Act. No. 99 of 1956. But these issues take a long time in appearing and in fact those for 1978 had not appeared even by the end of 1979. So it was decided to compile a bibliography of our own of books printed in English in 1978 which had already been received in the National Library Calcutta upto December 1979 when the work actually began.4 A list of all such books was compiled from Inward Register of National Library and this was used as the source bibliography for the purpose of sampling. Again we could not have recourse to stratification by reference to Dewy Decimal classification as many of the books had not been processed by the library. So we stratified the publication by manual inspection of titles initially. It must be mentioned that such a procedure was possible because of the limited number of publications in the second language situation as compared to that in a native language situation such as the American and the British.
It was found by inspection of earlier entries that about 75 % of the publications of a particular year was received by the end of the following year and the remaining kept trickling in for 3 or more years later. In order to make up for whatever deficiency might have been caused by our listing at the end of 1979 we repeated the exercise once again at the end of 1980 and 1984 and found that only about 10% more publications had arrived. It may be mentioned in passing that our 140 texts from books covering all the categories were sampled from over 1200 titles which amounts to about 8%. The sampling was done from the lists separately with the help of a random number table. Needless to say, whenever a selected book was not accessible the next available on the shelf was selected. Whenever sufficient number of texts did not turn up in this process, other texts were deliberately chosen to fill the category.

Government Documents: As in the case of books no catalogue of Government Publications in 1978 was available upto the end of 1980 and therefore the same procedure of listing from the Inward5 Register and random sampling was followed. In this case the job was simplified because most of the publications of the Govt. Of India are in English. Random sampling and filling out of the texts required for various sub-categories was carried out on the lines done in the case of books.
It will be noticed that we have 37 texts in category H as compared to 30 of the American and British. The decision to alter the number was influenced by the fact that there are in India two types of Govt. Documents, the Union and the State ones. We have included 26 texts from the Union Government Documents and 7 texts from State Government Documents. It may be mentioned also that the bulk of book publication of proportional or even greater weightage could not be seriously considered in view of our commitment to the original design.

Press Materials: The sampling of news-papers and particular issues of newspapers was first carried out. The 53 English newspapers received in the Central Library6, Bombay was considered to be the universe out of which all the 6 national papers were retained and 15 regional papers were selected to represent the different regions of India. From these again, from each newspaper 16 daily issues and 4 Sunday editions were sampled with the help of a random table. Of the required issues several were not in the files of the library. So as usual the next number was taken. From these the actual texts required were identified and the categories filled out.

Periodicals: Initially an attempt was made to compile sample lists of periodicals from the Source Book Press in India 1977 and follow the same procedure for sampling texts as we had done in the case of Books and Press Materials. However, it was discovered that such a procedure tended not only to exclude most or all of the most ‘popular’ and well circulated periodicals but threw up those that were either ‘unheard of’ or ‘unavailable’ or had ceased to be published. In the circumstances we took recourse to the other procedure i.e. of treating the holdings of Central Library in Bombay and the National Library in Calcutta as the universe from which to draw the samples. This proved to be wise as far as ‘popular periodicals’ was concerned; but in the case of learned Journals we had to follow a different procedure.
On a impressionistic basis, it was decided to pick on the richest known libraries: The holdings of the Tata Institute of Fundamental Research, The Indian Institute of Technology, Bombay were used for sampling materials pertaining to Sciences and Technology, the holdings of the Tata Institute of Social Sciences for materials pertaining to social sciences in addition to the holdings of University libraries- Bombay, Poona, Shivaji and Baroda. The actual procedure followed was to compile category-wise select bibliographies and then sample the texts required to fill out the categories.
From the foregoing description of selection procedures followed in building the corpus it is clear that the corpus cannot be claimed to be a "stratified random sample" of Indian English in the strict statistical sense. We, like the builders of the LOB corpus were guided by our aims - of ensuring maximum comparability with the other two corpora and creating a truly representative sample of edited and published Indian English writings.

Distribution of the material

The distribution of texts over different categories and the matching of individual texts have been kept more close to the LOB than to the Brown corpus. The widest difference is to be found in the weighting given to categories in the section, Imaginative Prose. This was inevitable as the available texts in the categories L to P were short of even the number required ! And the weighting given to short stories as against full length novels also is the result of the same handicap.
However, in the case of all the other categories the differences are very marginal as can be seen from the following break up:

Category A (Press: reportage)

A01-06 National daily Political
A07-08 " " Sports
A09-11 " " Society
A12-15 " " Spot news
A16-17 " " Financial
A18-19 National weekly Political
A20-21 " Sunday Sports
A22 " " Spot news
A23 " " Financial
A24-26 " " Social/cultural
A27-31 Regional daily Political
A32-33 " " Sports
A34-37 " " Spot news
A38 " " Financial
A39-40 " " Cultural
A41 " weekly Sports
A42 " " Society
A43 " " Spot news
A44 " " Cultural

Category B (Press: editorial)

B01-06 National daily Institutional editorial
B07-08 " " Personal editorial
B09-11 " " Letters to the editor
B12-14 " Sunday Institutional editorial
B15 " " Personal editorial
B16 " " Letters to the editor
B17-22 Regional daily Institutional editorial
B23-24 " " Letters to the editor
B25-26 " Sunday Institutional editorial
B27 " " Letters to the editor

Category C (Press: reviews)

C01-06 National daily Book, music, cinema, painting, folk art etc.
C07-13 " Sunday  
C14 " weekly  
C15-16 Regional daily  
C17 " Sunday  

The names of newspapers and the datails of texts drawn from each is shown in Table No.2.

Category D (Religion)

D01-08 Books
D09-17 Periodical and Journals

Category E (Skills, trades and hobbies)

E01-05 Homecraft, handiman
E06-10 Hobbies
E11-13 Music, dance
E14 Pets
E15-18 Sports
E19-20 Food
E21-22 Travel
E23-26 Miscellaneous
E27-35 Trade, professional journals
E36-38 Agriculture, farming

Category F (Popular lore)

F01-22 Popular Politics, psychology, sociology
F23-30 Popular History
F31-33 Popular Health, medicine
F34-37 "Culture"  
F38-44 Miscellaneous  

Category G (Belles lettres, biography, essays)

G01-35 Biography, memoirs
G36-41 Literary essays and criticism
G42-50 Arts
G51-70 General essays

Category H (Miscellaneous)

H01-26 Central Government document
  A H01-12 Reports, department publications
  B H13-14 Acts
  C H15-20 Proceedings, debates
  D H21-26 Other Government documents
H27-32 State Government documents
H33-37 Industry reports, house organ, University catalogue.

Category J (Learned and scientific writings)

J01-12 Natural and physical sciences
J13-17 Medicine
J18-21 Mathematics
J22-35 Social, behavioural sciences
  A J22-25 Psychology
  B J26-30 Sociology
  C J31 Demography
  D J32-35 Linguistics
J36-50 Political Science, law, education, commerce
  A J36-39 Education
  B J40-47 Politics, economics, commerce
  C J48-50 Law
J51-68 Humanities
  A J51-55 Philosophy
  B J56-59 History
  C J60-63 Literary criticism
  D J64-66 Art
  E J67-68 Music
J69-80 Engineering and technology

Category K (General fiction)

K01-12 Novels
K13-58 Short stories

Category L (Mystery and detective fiction)

L01-03 Novels
L04-06 Short stories
L07-11 Novels
L12-23 Short stories
L24 Novel

Category M (Science fiction)

M01-02 Short stories

Category N (Adventure)

N01 Short story
N02-04 Novels
N05-13 Short stories
N14 Novel
N15 Short story

Category P (Romance and love story)

P01-02 Short stories
P03 Novel
P04-07 Short stories
P08 Novel
P09-15 Short stories
P16 Novel
P17-18 Short stories

Category R (Humour)

R01-05 Short stories
R06 Book
R07-09 Articles from periodicals

The American (Brown), British (LOB) and Indian Corpora compared

Categories A-C: In terms of weighting between national and regional newspapers the Indian corpus conforms more closely to the British. This has been deliberate more so because, like the British situation, Indian newspapers have a clear-cut distinction between the national and regional on the basis of both distribution and circulation figures. The proportion of texts drawn from the National and the Regional papers is 62% to 38% in the Indian corpus as compared to the British 60% to 40% (see table 2). As to the sub-categories of texts and their distribution over dailies, Sundays and weeklies, there is even more correspondence between the two (see table 3), except that the two sub-categories society and cultural had to be collapsed into one as no such hard and fast distinction could be observed in the newspaper reportage of the Indian Press. The other very marginal difference is that there are fewer personal editorials in the Indian corpus.

Category D: In terms of subcategories Religious prose is not classified either in Brown or in LOB corpus; but while sampling texts the builders of LOB corpus have by inspecting the Brown texts arrived at a decision to include ‘stylistically heterogenous texts ranging from learned to popular committed writing’. The same procedure was followed in selecting texts for the Indian corpus except that the sub-division ‘tracts’ is unrepresented. The distribution of texts from books and periodicals is nearly maintained (see table 4).

Categories E-J: Sub-categories of texts in the Indian corpus have been matched almost perfectly with the LOB corpus. However, it was not always possible to match the individual texts in terms of the type of source, book or periodical for materials in the three categories E, F and G. While books are over-represented in the case of category E, they are somewhat under-represented in the case of categories F and G (see table 4). As already stated earlier we deliberately altered the weighting between category G and H. G has been reduced by seven texts and H increased by the same number. This was done to reflect the Indian situation in which, firstly, Government documents divide themselves into Central and State govt. documents and the bulk of those far exceeds any other printed material in English except Press materials. This has been reflected in the greater representation of government documents in the Indian corpus. And foundation report is unrepresented (see table 4). J texts have been matched very closely in all the three corpora. This is the only category which has one to one correspondence of weighting to sub-categories (see table 4).

Categories K-R: This section of the corpus that is Imaginative prose is maximally mismatched. As already stated sufficient number of texts were simply not available and the possible consequences of this was discussed with the experts in the field and it was felt that comparability would suffer only marginally. We repeat some of the points here. The sub-categories K, L, M, N and P all representing fiction is a sort of cline from general fiction (K), mystery and detective (L), science fiction (M) adventure and Western fiction (N) and romance and love story (P). The classification is based on the theme/treatment and very often is bound to be overlapping; and the interest of the corpus compile is ‘style’. In the process of sampling, it is quite possible for the selected portion of the work as text to run wide of mark, especially in the case of novels. In view of all this, firstly, the sub-categories were defined negatively, that is, whatever was not broadly speaking L, M, N or P was considered to be K; and the selected texts especially from novels were inspected and placed in the categories so defined. In the case of short stories the question did not arise. The fact remains that the number of texts are matched only in the categories L and R. In the case of K we have double the number i.e. 58 in place of 29; science fiction only 2 as against 6;adventure only 15 in place of 29; and romance and love story only on 18 in place of 29. Again, mystery and detective is for the West largely detective and mystery surrounding death, murder etc., but for the Indian it includes other kinds of mystery in the sense of ‘mysterious’ or miraculous. Similarly in the case of adventure and Western fiction, there is nothing at all corresponding to ‘western fiction’ in India. So the sub-category is wholly comprised of ‘adventure’.

Now it must be stated that if the Imaginative Prose section in the Indian corpus is not on the face of it quite comparable to that of Brown or LOB, it is designed to be truly representative of Indian English.

Table No. 2 - Details of number of texts drawn from different newspapers.

No. Name of the newspaper

(National newspapers)

Number of texts drawn

   

daily

Sunday/
weekly

Total

1 The Hindu, Madras

9

2

11

2 Economic Times, New Delhi

1

2

3

3 The Statesman, Calcutta

5

1

6

4 The Hindustan Times, New Delhi

4

4

8

5 The Times of India, Bombay

11

8

19

6 The Indian Express (various editions)

3

3

6

Sub-totals

33

20

53

(Regional newspaper)
1 Business Standard, Calcutta 3 - 3
2 Deccan Herald Bangalore 2 1 3
3 The Tribune, Chandigarh 3 1 4
4 National Herald, Lucknew 1 1.5 2.5
5 Searchlight, Patna 1 1 2
6 The Assam Tribune, Gauhati 2 - 2
7 Amrit Bazar Patrika, Calcutta 1 - 1
8 Deccan Chronicle, Secunderabad 3 - 3
9 The Western Times, Ahmedabad 1 - 1
10 Madhya Pradesh Chronicle, Bhopal 2 - 2
11 Nagpur Times, Nagpur 2 0.5 2.5
12 Navhind Times, Panaji 1 0.5 1.5
13 Northern India Patrika, Allahabad - 1.5 1.5
14 Poona Herald, Poona 3 - 3
15 Blitz Weekly, Bombay 3 - 3
 

Sub-totals

28 7 35
 

Grand Totals

61 27 88

Table 3. Categories A-C: The American, British and Indian corpora compared

 

American corpus

British corpus

Indian corpus

       

National

Provencial

 

National

Regional

A. Press: Reportage daily weekly   daily Sunday daily weekly 1)   daily Sunday daily Sunday  
Political 10 4 14 6 2 5 - 13 6 2 5 - 13
Sports 5 2 7 2 2 2 1 7 2 2 2 1 7
Society 3 - 3 2 - - 1 3 3 3 2 2 102
Spot news 7 2 9 4 1 4 1 10 4 1 4 1 10
Financial 3 1 4 2 1 1 - 4 2 1 1 - 4
Cultural 5 2 7 3 1 2 1 7         -

Total

44

Total

44

Total

44
 
B. Press: Editorial
Institutional 7 3 10 4 2 3 1 10 6 3 6 1 16
Personal 7 3 10 4 2 3 1 10 2 1 - 1 4
Letters to the editor 5 2 7 3 1 2 1 7 3 1 2 1 7

Total

27

Total

27

Total

27
 
C. Press: reviews 14 3 17 6 5 3) 2 1 17 5 9 2 1 17

Total

17

Total

17

Total

17

1) Including Provincial Sunday

2) Including "cultural"

3) The Times Literary Supplement and The Times Educational Supplement

Table 4- Categories D-J: The American, British and Indian Corpora compared

   

American

corpus

British

corpus

Indian

corpus

D. Religion      
  Books 7 9 8
  Periodicals 6 7 9
  Tracts 4 1 -
E. Skills, Trades and Hobbies      
  Books 2 5 9
  Periodicals 34 33 29
F. Popular Lore      
  Books 23 16 10
  Periodicals 25 28 34
G. Belles Lettres etc.      
  Books 38 41 29
  Periodicals 37 36 41
H. Miscellaneous      
  Govt. Documents 24 24 32
  Foundation Reports 2 2 -
  Industry Reports 2 2 2
  Univ. catalogue 1 1 1
  Ind. House Organ 1 1 2
J. Learned      
  Natural Sciences 12 12 12
  Medicine 5 5 5
  Mathematics 4 4 4
  Soc. Sciences 14 14 14
  Pol. Science, Law, Education 15 15 15
  Humanities 18 18 18
  Technology and Engineering 12 12 12

Table 5 - Categories K-R: The American, British and Indian corpora compared

   

American

corpus

British

corpus

Indian

corpus

K. General Fiction      
  Novels 20 20 12
  Short stories 9 9 46
L. Mystery and Detective Fiction      
  Novels 20 21 9
  Short stories 4 3 16
M. Science Fiction      
  Novels 3 3 -
  Short stories 3 3 2
N. Adventure and Western      
  Novels 15 15 4
  Short stories 14 14 11
P. Romance and Love Story      
  Novels 14 16 3
  Short stories 15 13 15
R. Humour      
  Novels 3 3 -
  Short stories - - 5
  Essays, etc. 6 6 3
  Books - - 1

Coding Key:

Alphanumeric characters represent themselves. In the case of alphabet symbols, the letter represented is lower case unless otherwise specified:

A = a B = b C = c Etc.
1 = 1 2 = 2 3 = 3 Etc.

Letter preceded by * = the same letter (word initial Capital)

Letter precede by ­ = the same letter (sentence initial Capital)

NB - both ­ and * are used when a sentence initial capital coincides with word initial capital.

*A = A Word initial only

­ B = B Sentence initial only

­ *J = J Word and sentence initial at the same time e.g. as in: John was blind.

Other Characters:

* is reserved as a prefix for a compound coding symbol. When not preceded by *, all other characters represent themselves except for £, $, ­ , ¬

that is to say:

1 . = . Full stop
2 : = : Colon
3 ; = ; Semi-colon
4 , = , Comma
5 " = " Double quotes: begin and end quotes are distinguished by a space before and after respectively
6 ‘ = ‘ Single quotes (but not apostrophe) begin and end quotes are distinguished by a space before and after respectively
7 ? = ? Question-mark
8 ! = ! Mark of exclamation
9 - = - Minus when separated by spaces on either side, hyphen when not so separated.
10 -- = -- Dash
11 % = % Per cent
12 & = & (and)
13 ( = ( Left brace
14 ) = ) Right brace
15 + = + Plus
16 / = / Slash, oblique
17 [ = [ Left bracket
18 ] = ] Right bracket
19 @ = @ At ( the rate of)
20 = = = Equals
21 Space = space
22 > = >  
23 < = <  
24 x = x into (represents ‘multiplied by’ when separated by spaces)
25 ¬ = grammatically marked (always follows the word)
26 ­ = sentence initial capital
27 * = Word initial capital. (N.B. both ­* occur when word initial capital coincides with sentence initial capital)
28 £ = begin non English word
29 $ = new paragraph or new line

Compound Coding

*’

=

apostrophe
**<

=

begin major heading
**>

=

end major heading
*<

=

begin minor heading
*>

=

end minor heading
*@

=

IBM ASCII 248 (degree symbol)
*+

=

£ (pound)
*-

=

$ (dollar)
*/

=

* (asterisk)
*#

=

end of corpus text
**#

=

end of corpus
*?

=

uncoded character (see below)
**[

=

begin comment tag
**]

=

end comment tag
*=

=

upper case Roman numeral
**=

=

lower case Roman numeral
*;

=

begin subscript
**;

=

end subscript
*:

=

begin superscript
**:

=

end superscript
*(

=

begin hybrid word/expression
*)

=

end hybrid expression
\0

=

abbreviations. Sequence of abbreviations or initials are enclosed in *(0 *)
*¬

=

Included sentence
*%

=

%o (per thousand)

Type-shift

*

=

end typeshift
*1

=

begin typeshift for citation
*2

=

begin capitalization
*3

=

begin typeshift for highlighting emphasis (including italics)
*4

=

begin Indian Language word
*5

=

begin Indian Language expression or passage
*6

=

end Indian Language expression or passage
*7

=

begin foreign word
*8

=

begin foreign expression
*9

=

end foreign expression
*?0

=

. (dot) under the preceding character (to indicate retroflex)
*?00

=

. (dot) on the preceding character (to indicate retroflex)
*?1

=

-(macron) on the preceding character (to indicate vowel length)
*?2

=

“ (acute accent) on the preceding character (to indicate vowel length)
*?3

=

‘ (grave accent) on the preceding character (to indicate vowel length)
*?4

=

~ (tilde) on the preceding character (to indicate vowel length)

(NB: The code *[ applies to languages that use the Roman alphabet. The following apply to all foreign languages irrespective of what script they use.)

Foreign language materiel in non-Roman alphabet transcribed in Roman script

*[1

=

begin Assamese material
*[2

=

begin Bengali material
*[3

=

begin Gujarati material
*[4

=

begin Hindi material
*[5

=

begin Kannada material
*[6

=

begin Kashmiri material
*[7

=

begin Malayalam material
*[8

=

begin Marathi material
*[9

=

begin Oriya material
*[10

=

begin Punjabi material
*[11

=

begin Sanskrit material
*[12

=

begin Sindhi material
*[13

=

begin Tamil material
*[14

=

begin Telugu material
*[15

=

begin Urdu material

Interpretive Codes:

I Apostrophe *’

*’

=

apostrophe for possessive e.g. John’s book = *JOHN*’s BOOK.

Students’ Union = STUDENTS*’ UNION

*'1

=

Contracted from of is e.g.

John's coming = *JOHN*' IS Coming

*'2

=

Contracted form of has e.g.

John's been ill = *JOHN*' 2S BEEN ILL

*'

=

apostrophe for contracted form of had

e.g. He'd done well = HE*'D DONE WELL

*'1

=

Contracted form of would e.g.

He'd stand for hours = HE*' 1D STAND FOR HOURS.

*'3

=

Other contractions

e.g. d'Estaing = D*'3*ESTAING

'Tis a pity = *'3TIS A PITY

*'4

=

Contraction of not e.g. don't = DON*'4T
*'5

=

abbreviation for minutes (degree & mins)

e.g. 4' 30' = 4*@ 30 *'5

*'6

=

abbreviation for foot/feet e.g.

3' = 3*'6

*'7

=

notation for glottal fricative

II Left arrow (¬ )

( 1) To as preposition is unmarked

To as infinite marker is marked with a left arrow e.g. to go = To ¬ GO

(2). That as a subordinating conjunction in unmarked; that as any other is marked e.g. that day = THAT DAY;

The man that you spoke of = THE MAN THAT ¬ YOU SPOKE OF

Greater than that of the other = GREATER THAN THAT ¬ OF THE OTHER

III Double Quotes ( " )

*" = inches e.g. 6’ 3" = 6*’66nbsp; 3*"

Greek letters are marked by a preceding **Y FOR LOWER for lower case and **Z for upper case characters and represented by Roman alphabets as follows :

Other notations

Mathematical symbols

**MN = mathematical notation e.g. S1 , T1

**MS = mathematical e.g. = , ± , ¹ , ®

**MF = mathematical conventional figure 106, 8-5

**ME = mathematical equation

The material and its organization.


1. The text of a sample starts with the first sentence or the first section on the first page of the sampled text in the case of books and at the beginning of the article in the case of samples drawn from periodicals, journals and newspapers etc. and ends with the sentence containing the2.000th word.

2. Each corpus text is headed by a line in which the text number is indicated enclosed in comment tags (e.g. **[TXT.A01**] ) and is followed by a line again enclosed in comment tags indicating the number of words in that text (e.g. **[ No. of words = 02008**] ).

3. Headings are coded and included in the texts. The title of the book is often included in the texts ; but there are some inconsistencies in this regards as pre – editing was handled by various persons.

4. Sentences used as " tantalisers" etc. , are also included with blanket comment tags** [BEGIN LEADER COMMENT **] and ** [END LEADER COMMENT **]. In this also there are some inconsistencies.

5. Extra textual material such as maps , charts , diagrams, tables etc. , are excluded and represented by descriptive tags.

6. As a rule footnotes are excluded and are represented by a descriptive comment tag **[FOOT NOTE **]. But in the case of texts in which the footnotes were even longer than the body of the text , they have been included.

7. Long foreign quotations and poetry quotations are excluded and represented by descriptive comment tags.

8. Mathematical equations, long formulas etc. are excluded and represented by mathematical symbol codes (see coding key).

9. The text categories are included in the order listed in the tables above.

10. The texts are preserved in card image form (80- character). The first 72 characters of each line contain the text of the sample and the last 7 characters indicate the unique location number. The 73rd character is always a blank.

11. One or more blank spaces ( sometimes a whole blank line ) separates two words. A word is orthographically defined as a character or sequence of characters surrounded by blank spaces. Like in the case of the 1st version of the Brown Corpus , words are often broken at the end of a line( i.e. the 72nd character ). If a word is thus broken it is continued starting with the first column of the next line. If a word ends at the 72nd column , the first column of the next line is left blank.

A sample portion of a text as it appears in the print – out is reproduced in the following page.

Sample of print – out

**[TXT. J52**]  
**<*3*NATURE,  *MAN ; AND *GOD IN THE *4*VEDAS*O**>$*<31. *2 THE PROBLEM

0010J52

OF      CAUSATION*O*> $+3*2^ MAN IS+0 MOST CONCERNED WITH HIS ENVIRONMENT

0020J52

;      THE WORLD IN SPACE AND TIME. ^ HENCE, IT IS NATURAL THAT WHEN HE BE

0030J52

COMES REFLECTIVE, HE WANTS TO_ UNDERSTAND THE NATURE OF THIS WORLD. ^ T

0040J52

HE      PHYSICAL WORLD SEEMS TO HIM THE PART AND PARCEL OF HIS LIFE. ^ WHE

0050J52

N       HE TRIES TO_ UNDERSTAND THE NATURE OF THE PHYSICAL WORLD, THE QUE

0060J52

STIONS       THAT COME UP ARE _ _WHO HAS CREATED THIS WORLD; WHAT ARE THE

0070J52

     CONSTITUENT ELEMENTS OUT OF WHICH IT IS CREATED AND HOW IT IS CREAT

0080J62

ED?       ^ IN OTHER WORDS, WE WANT TO_ KNOW ITS EFFICIENT CAUSE, THE MATE

0090J52

RIAL       CAUSE, AND THE PROCESS OF CREATION. $^ THUS THE PROBLEM OF CAUSA

0100J52

TION      IS THE PRIMARY QUESTION IN THE UNDERSTANDING OF THE PHYSICAL WO

0110J52

RLD_ _      OR WHAT WE CALL *NATURE. ^ THE *4*VEDAS, AS IS KNOWN, ARE MORE

0120J52

     POETIC IN THEIR CONTENT THAN LOGICAL. ^ STILL ONE CAN TRACE CERTAIN I

0130J52

MPORTANT      IDEAS REGARDING CAUSATION BEHIND THE POETIC IMAGINATIONS. $

0140J52

^      THE PRINCIPLE OF CAUTION IN THE *4*VEDAS, THE EARLIEST LITERATURE

0150J52

     OF THE *HINDUS, SEEMS TO_ APPEAR IN THE CONCEPT OF *4*RTA. *4^ *RTA REP

0160J52

RESENTS      THE LAW, UNITY OR RIGHTNESS, UNDERLYING THE ORDERLINESS WE O

0170J52

BSERVE      IN THE WORLD.*4^ *RTA , LITERALLY MEANS THE “COURSE OF THINGS“

0180J52

      ^ THIS CONCEPTION SEEMS TO_ HAVE BEEN ORIGINALLY DERIVED FROM THE

0190J52

     REGULARITY OF THE MOVEMENTS OF THE HEAVENLY BODIES LIKE THE SUN, THE

0200J52

E      MOON, AND THE STARS, THE ALTERNATIONS OF DAY AND NIGHT AND OF THE

0210J52

     SEASONS. $^ IN THE *4*VEDAS, THERE ARE NO HYMNS ADDRESSED SPECIFICAL

0220J52

LY      TO *4*RTA, BUT BRIEF REFERENCES TO THE IMPORTANT CONCEPTS ARE FOU

0230J52

ND      REPEATEDLY IN THE HYMNS TO *4*VARUNA (WHO MANTAINS THE PHYSICAL O

0240J52

RDER),      *4*AGNI, *4*VISVEDEVAS \OETC. ^ THE FOLLOWING HYMN WILL ILLUST

0250J52

RATE      THE POINT: **[VERSE**] $^ GRADUALLY THE CONCEPT OF ± 4± RTA TAKES

0260J52

A      NEW MEANING FROM EXTERNAL PHYSICAL ORDER OR UNIFORMITY OF NATURE_

0270J52

_      IT ACQUIRES THE SIGNIFICANCE OF A MORAL ORDER. ^ THE WHOLE WORLD WA

0280J52

Basic Technical Information.

The corpus is available at cost to bonafide researchers in India from the department of English, Shivaji University, Kolahapur, and is being made available to bonafide researchers outside India through the International Computer Archive of Modern English (ICAME), at the Norwegian Computing Centre for the Humanities, Bergen, Norway. The material is available on magnetic tape in the following format:

1. The tape has no label.

2. There are 24 files on the tape containing the entire material as shown below:

Sr.No of Texts   Sr.No of Texts
the file Contained   the file Contained
1 A01 – A22   13 H01 - H20
2 A23 – A44   14 H21 - H37
3 B01 – B27   15 J01 – J27
4 C01 – C17   16 J28 – J54
5 D01 – D17   17 J55 – J80
6 E01 – E20   18 K01 – K29
7 E21 – E38   19 K30 – K58
8 F01 – F22   20 L01 – L24
9 F23 – F44   21 M01 – M02
10 G01 – G25   22 N01 – N15
11 G26 - G50   23 P01 – P18
12 G51 – G70   24 R01 – R09

3. Each text is separated by a blank record in the file.

4. The text is divided into 80 – character lines, as follows :

a) Text : 72 characters

b) Location number : 8 characters

5. Each tape record ( block ) contains 10 lines, except that the last block in each file may contain less than 10.

6. The character code used is ASCII and the material is recorded in 9-track, 1600 fpi density.

Note on copyright.

Permission to use materials under copyright was sought through a form letter sent under certificate of posting. We are glad to say that most copyrightholders responded promptly. Remainders were sent to those who did not respond for over three months in which option was given to them not to answer the letter if they had no objections to our using the materials.

Individual acknowledgements are made in the notes on text extracts. In the case of those who have not responded so far no such acknowledgement appears.

References

Bansal, R. K. 1969. The intelligibility of Indian English Monograph No. 4, CIEFL, Hyderabad.

Desai, S. K. 1974. Experimentation with language in Indian Writing in English ( Fiction ).Monograph of the Dept. of English, Shivaji University, Kohlapur.

Francis, W. N. and Henry Kucera. 1964.Rev 1979.Manual of information to accompany A standard Corpus of present-day edited American English. Dept. of Linguistics, Brown University, Providence, R. I.

Johasson, Stig, G. N. Leech and Helen Goodluck. 1978. Manual of Information to Accompany the Lancester-Oslo/Bergen corpus of British English. Dept. of English, University of Oslo, Oslo.

Kachru, B. B. 1961. An analysis of some features of Indian English: A study in linguistic Method. Unpublished Ediburgh thesis

----------- 1965. The Indianness in Indian English. Word 2: 391-410.

----------- 1975. Lexical innovations in South-East Asia. International Journal of the

Sociology of Language. Vol. 4. Mouton, The Hague.

---------- 1979. The new Englishes and an old models News letter.

January 1979, CIEFL, Hyderabad.

----------- 1981. The pragmatics of non-native varieties of English. In Smith Larry ( ed ) English for cross – cultural communication. Macmillan 15-39.

Nihalani, Paroo, R. K. Tongue and Priya Hosali. 1979. Indian and British English: A handbook of Usage and pronunication. O. U. P. New Dehli.

Shastri, S. V. 1978. English word-meanings and their American and Indian variants – a study of six lexical items. Unpublished Lancaster University thesis.

ADDENDA

1. Coding key:

Some of the details of coding key described in this manual are irrelevant for the users of the corpus in the form in which it now is being made available. The original version was in 64 character set i.e. the text was in all capitals; but the present version is in 96 character set i.e. the text is in upper – lower case characters. Hence the asterisk ( * ) used as a code to indicate word initial capitals everywhere in the text is irrelevant. Similarly the code for all capitals (*2….0) is also irrelevant, as capitals and lower case letters as lower case letters. However, the code for sentence – initial capital ( ^ )has been retained though it is redundant. The asterisk as a word initial code has also been retained when a word with initial capital occurs at the beginning of a sentence. This has been done to facilitate machine processing of the corpus texts.

2. The material and its organisation:

The record length in the present version of the corpus is 100 characters and not 80 as in the original version. The unique location number of seven charcters has been shifted to the beginning of the line and the text begins at the 9th characters of each line. Only one blank space seperates two wordsexcept that any number of blank spaces may occur at the end of a line.

3. Basic technical information:

The file organization remains unchanged, but each line contains 100 characters and each block contains 80 lines except that the last block in any file may contain less than 80 lines. The rest of the technical details remain unchanged.

The coding for Greek letters stand modified as follows:

*Y to mark lowercase letters and *Z to mark upper case letters of the Greek alphabet.

The coding for mathematical symbols also stand modified as follows:

*Mn = mathematical notation

*Ms = mathematical symbol

*Mf = conventional mathematical figure

*Me = mathematical equation. 

  **[ text. a03** ]
0010A03 **<*3 Creeping Detente In Africa**> $"^ *DETENTE, " said \0 Dr Bruno
0020A03 Kreiski, Chancellor of Austria (which alongwith Switzerland and
0030A03 Sweden is one of the three official neutral States in Europe ) in his
0040A03 address to the Royal Institute of International Affairs in London
0050A03 on July 4, "is not the consequence of sublime human insight but simply
0060A03 a result of a state of military balance". ^ This realistic definition
0070A03 of a state of relationship between the Soviet bloc and Western Europe,
0080A03 \0US and Canada, which has been widely criticised in the West
0090A03 as tattered by developments in Africa ( and Afghanistan and South Yemen )
0100A03 explains the about-turn in Western policy on Angola that_ appears
0110A03 to_ be taken place now quietly and even secretively. $^ It was just a
0120A03 month ago, following the incursion of Katangan exiles to the mineral rich
0130A03 Shaba province of Zaire and the massacre of whites in Kolwezi,
0140A03 that Western Europe, backed by the \0US, were planning the establishment
0150A03 of a pan-African force ( armed and founded by the West ) to_ protect
0160A03 states threatened by "Soviet-Cuban" ventures. ^ President d*’3Estaing
0170A03 of France, after his French Legionnaires repelled the Katangans and
0180A03 rescued the surviving whites in Kolwezi was hailed as "the \Gendarme
0181A03 of
0190A03 Africa." The \0US later supplied transport planes to_ ferry the units
0200A03 formed from Morocco, Senghor and some other former French colonies
0210A03 to the Shaba Province. ^ Meetings were held in Brussels at which the western
0220A03 countries considered how to _ strengthen the economy and the security
0230A03 forces of President Mobuto. It appeared that the detente was to_ give
0240A03 place to an east-west confrontation in Africa; that the Western hawks
0250A03 were prevailing over the doves among them the British Prime Minister
0260A03 Callaghan and some of his OEEC colleagues notably Holland and
0270A03 Denmark. $*<&*3Grave Concern*>$^ The developing situation today projects
0280A03 a completely different picture and is generating grave concern to

N.B. £ symbol appears as \ ( back slash ) in this printout.

FOOTNOTES

1Kashru`s samples are drawn almost entirely from "creative writings" although, he has his own reason for doing so,

while Nihalani et al is based on "available" samples.

2The idea of compiling a parallel corpus of Indian English suggested itself, when the present author was doing a

comparative study of some lexical items in American, British and Indian English in Lancaster in 1977. He used

the Brown and the LOB Corpora for the American and British English meanings, but had to do with available

samples for the Indian English meanings (Shastri,1978)

3 Personal Communication.

4 We are thankful to the Director, National Library, Calcutta and particularly to Dr. M. N. Nagraja, Dy. Liberian, Miss Anima Das of the Processing Section , M/s V. Kotnala and A.B.Roy of the Reprography Division for their assistance in carrying out this job.

5 We are thankful to Miss Chitra Mallik for assistance in carrying out this job.

6We are thankful to the Librarian and particularly to Mr. Vaitee for agreeing to hand over these issues to us at a cost.

LIST OF TEXT EXTRACTS

A B C D E F G H J K L M N P R


Oppdatert 5.2.1998  Jørn Thunestvedt