A Guide To Corpus Linguistics [PDF]

A Guide to Corpus Linguistics Impressum: Department of English language and linguistics Chair: Prof. Dr. Joybrato Mukhe

121 0 2MB

Report DMCA / Copyright

DOWNLOAD PDF FILE

Author / Uploaded
aspired

0 0 0
Gefällt Ihnen dieses papier und der download? Sie können Ihre eigene PDF-Datei in wenigen Minuten kostenlos online veröffentlichen! Anmelden

Datei wird geladen, bitte warten...

Zitiervorschau

A Guide to Corpus Linguistics

Impressum: Department of English language and linguistics Chair: Prof. Dr. Joybrato Mukherjee Viktoria Künstler Patrick Maiwald Sven Saage

1) An introduction to corpora ............................................................................ 4 1.1) What is a corpus?.......................................................................................................... 4 1.2) Types of corpora ........................................................................................................... 6 1.3) What can we do with corpora? .................................................................................... 8 1.4) What do we need corpora for? .................................................................................. 10 1.5) Working with a corpus ............................................................................................... 13 1.6) Why use corpora in the classroom? .......................................................................... 14 1.7) Five steps to achieve corpus literacy ......................................................................... 15 1.8) Integrating corpora into the classroom..................................................................... 17

2) Using AntConc (Version 3.2.1w) step by step: ........................................... 21 2.1) Upload the relevant texts............................................................................................ 21 2.2) Concordance................................................................................................................ 21 2.3) Word List..................................................................................................................... 27 2.4) Key Words ................................................................................................................... 29 2.5) Create your own corpus ............................................................................................. 32 2.6) Useful definitions......................................................................................................... 35 2.7) Other Concordancers ................................................................................................. 37 2.7.1) The Compleat Lexical Tutor............................................................................... 38 2.7.2) The BNCweb......................................................................................................... 38

3) The practical application of corpora in the classroom.............................. 40 3.1) Corpora in linguistics ................................................................................................. 40 3.1.1) Examining football language .............................................................................. 40 3.1.2) Examining Hip Hop language............................................................................. 42 3.1.3) Observing gender differences in language use .................................................. 44 3.1.4) Analysing a variety of a language....................................................................... 45 3.1.5) Checking the prosody of a word ......................................................................... 46 3.1.6) Surveying the language of companies’ mission statements.............................. 47 3.1.7) Comparing phrasal constructions in Indian English, Hong Kong English and British English.......................................................................................... 48 3.1.8) Observing political texts in British newspapers ................................................ 48 3.1.9) Consulting the Compleat Lexical Tutor ............................................................ 49 3.1.10) Creating specific vocabulary lists ..................................................................... 50 3.1.11) Checking your own writing style ...................................................................... 51 3.2) Corpora in literary studies......................................................................................... 51 2

3.2.1) The use of concordance plots .............................................................................. 52 3.2.2) An analysis of Shakespeare’s King Lear ............................................................ 53 3.2.3) An Analysis of Wilde’s The Picture of Dorian Gray.......................................... 53 3.2.4) An analysis of Sterne’s A Sentimental Journey through France and Italy .................................................................................................................................. 55 3.2.5) Further suggestions of what can be done with corpora.................................... 57

4) Bibliography .................................................................................................. 58

3

1) An introduction to corpora 1.1) What is a corpus? A corpus is a collection of texts, written or spoken, stored on a computer. Corpora are compiled for a particular purpose. If you want to know more about your own writing style in personal letters to friends, you can take all the letters you have ever written to your friends, digitalise them, put them all into a file on your computer, and what you get is your personal corpus – a letter corpus. Or, if you wanted to find out more about J. K. Rowling's style of writing, you could take all seven Harry Potter novels, digitalize them and store them in a file on your PC. What you would get is a very specialized corpus on the language used by Rowling in Harry Potter. In contrast to these rather simple corpora, much more thought has been devoted to the process of compiling linguistic corpora. In fact, most linguistic corpora are huge collections of texts, either written text or transcribed speech (which would be spoken language that was recorded and then written down), and they can cover a whole lot of •

varieties of a given language: e.g. a corpus including texts written or spoken by English native speakers, English as a Second Language speakers (e.g. English in India) or learners of a language (e.g. German learners of English),

•

periods: there are a lot of corpora covering Present Day English (which is the English language as it is spoken today) and the Modern English period. Only a handful of corpora do exist that cover the Old English, Middle English and Early Modern English periods,

•

sizes: there are huge corpora such as the British National Corpus (BNC) with its 100 mio. words or the Collins Birmingham University International Language Database (COBUILD) corpus with its 450 mio. words, and there are comparatively tiny ones such as the London-Lund Corpus (LLC), which consists of only 500.000 words.

However, the most crucial decision to be made in the course of the compilation of corpora is the choice of the types of texts which are going to be included in a corpus. In fact, a random collection of texts does not constitute a corpus. A corpus must consist of texts chosen according to certain principles, and its usefulness will be judged on how well these principles are represented. For instance, a corpus could be compiled to reflect the language of a typical middle-class speaker of British English. A collection of texts from linguistic magazine reviews would not be representative of 4

this typical middle-class speaker of British English, because certain linguistic technical terms would occur frequently, while other words which appear in everyday language would presumably not occur at all. As a result, in this case a good corpus should cover many different genres, including informal and formal texts, and should ideally include both spoken and written texts. This corpus would then reflect the language which is actually used by this typical middle-class speaker of British English. It could subsequently be used as a standard reference when analysing other texts, such as written essays from learners of English at the University of Giessen. The International Corpus of English – Great Britain (ICE-GB) is such a reference corpus. It consists of written and spoken language with many different subgenres, some of which are shown in the figure below. ICE-GB Spoken

Written

- Direct

- Business

conversations

Letters

- Broadcast

- Press News

Interviews

Reports

- Parliamentary

- Novels/Stories

Debates - ...

- ...

With its various different types of written and spoken texts it is representative for average speakers of British English from the early 1990s. The next figure displays some of the most important linguistic corpora. The first seven corpora in the list are synchronic, which means that they cover only texts from the year that is indicated. COBUILD is a monitor corpus. Monitor corpora are usually compiled in the same way as reference corpora. But in contrast to them, they are updated regularly to represent the current language in use. This eventually means that older texts are removed from the corpus, while newer texts are added to it. The last four corpora are diachronic, meaning that they contain texts from the periods of time which are indicated and help to observe the changing of the English language throughout time. The term “variety” in the table indicates which variety of English the respective corpus consists of: American English/British English (that is, English as a Native Language) 5

and English as a Second Language varieties. The columns “Spoken" and "Written” indicate if the corpus consists of spoken or written language. In some cases a corpus contains both spoken and written texts – for example, the BNC consists of 10% spoken and 90% written texts. In the case of the ICE Corpus Collection “ongoing” means that some of the ICE-corpora have already been compiled (such as ICE-GB), while others are still being compiled and have not been released yet (such as ICEUSA).

1.2) Types of corpora The corpora shown above are reference corpora which were all carefully compiled by linguists and are very useful for linguistic purposes. Nonetheless, there are also other types of corpora , some of which will be commented on in the following. Special corpora do not contain standard language as a whole, but rather very specific language. The emphasis is not placed on analysing standard language, but on examining very specific phenomena. Special corpora can be compiled either according to a very specific genre, like the Wolverhampton Business English (WBE) corpus, which consists only of written business texts, or according to a very specific subgroup of a language community, such as the Bergen Corpus of London Teenage Language (COLT), which consists entirely of transcribed spoken language of London teenagers from 1993. Special corpora are usually rather small in comparison to reference corpora. This is not a major drawback, since the texts included are so specific that there is no need for many instances of a search query in order to prove or disprove 6

research questions. In this case, the right interpretation of the results is much more important than the quantity of hits. With the enormous boost of computers and the Internet, a new type of corpus has emerged within the last few years: the Internet Corpus. The World Wide Web serves as language database from which texts are obtained and assembled to form a corpus. The good news is that everyone can easily compile his or her own corpus, since the Internet is freely accessible and texts can be copied and pasted. The bad news is that, although the number of texts is incredibly huge and the texts can easily be copied, the Internet needs to be used with caution. Issues such as unknown authorship, the cloning of web pages and the constant adding and deleting of them are massive drawbacks which make working with Internet texts a complicated matter. Nonetheless, the Internet does also have its benefits if used as a corpus – for instance, when looking at very rare or newly coined words which might not occur in a standard corpus. However, it is important that anyone who chooses to use the Internet as a corpus pays attention to the disadvantages of the Internet when compiling their own corpus. Only then will an Internet corpus be beneficial for its users. Parallel corpora are combinations of two or more corpora in different languages. They contain texts which have been translated from one language into another. For example, a French newspaper article is translated into English and included in the English corpus, while an English article is translated into French and incorporated into the French corpus. With parallel corpora it is possible to compare direct translations. This type of corpus is particularly interesting for translators and for researchers in comparative linguistics. Historical (or diachronic) corpora consist of texts from different periods of time. With their help the development of a language can be traced, grammatical and lexical changes can be observed. Learner corpora are a subtype of special corpora and usually consist of texts written by non-native speaking learners of a foreign language. For example, a learner corpus can consist of term papers written by German first semester university students in English. The comparison of a learner language corpus with a native speaker corpus can reveal the major problems learners have with the acquisition of the foreign language. Thus, it can be helpful for learners of a language to work with a learner corpus. This is known as the concept of negative evidence. If learners are confronted with concordances presenting the typical mistakes of learner language, they may 7

become aware of the patterns in their mistakes. This will then initiate a process of knowledge restructuring, with the effect that the learners avoid to repeat the mistake in the future. There are many more types of corpora, the discussion of which would go beyond the scope of this introduction. For more information on types of corpora, see Hunston (2002) and O´Keefe, McCarthy & Carter (2007).

1.3) What can we do with corpora? Strictly speaking, with a corpus itself we can do nothing at all, since it is nothing other than a store of texts. For this reason, we need corpus access software, which helps us rearrange that store so that observations of various kinds can be made. A corpus is just the source, the text you wish to analyse. In order to work with a corpus, you need a specific tool that looks for the words or particular patterns you are looking for. This tool would be a so-called concordancer. A concordancer works much the same way as an internet search engine. In fact, an internet search engine IS a kind of concordancer that uses the World Wide Web as a giant corpus. You type in the word/pattern you are looking for, and what the concordancer does is search the corpus for this word/pattern and list all the instances in the middle of your computer screen with some context before and after. Here is an example of what the result looks like – the first screenshot is a concordance line in Google, while the second screenshot is a concordance line displayed in a linguistic concordancer. It can already be seen that, in contrast to Google, a linguistic concordancer appears to be more orderly. In creates a neat list of all the instances of the phrase “example of” and hence gives a better overview of all the hits.

8

Concordance line in Google

Concordance line in a linguistic concordancer

What you can see here is what the concordancer gives you for the search word “example”. The different lines (here: 58-70) are called “concordance lines”. The search word is also called “NODE” or “N”, the words to the left and to the right of it are called “N-1”, “N-2”, etc. and “N+1”, “N+2”, etc.,respectively.

9

The concordancer can do more than just list the examples randomly. The programs WordSmith Tools or AntConc, for example, can re-sort the context (either to the left or to the right of the node) so that it is possible to look for patterns and regularities in the use of a certain word.

A sorted concordance line

1.4) What do we need corpora for? Let me start by giving you a quote by Francis (the man who compiled the famous Brown Corpus):

In 1962, when I was in the early stages of collecting the Brown Standard Corpus of American English, I met Professor Robert Lee at a linguistic conference. In response to his query about my current interests, I said that I had a grant from the U.S. Office of education to compile a million-word corpus of present-day American English for computer use. He looked at me in amazement and asked, ‘Why in the world are you doing that?’ I said something about finding the true facts about English grammar. I have never forgotten his reply: ‘That is a complete waste of your time and the government’s money. You are a native speaker of English, in ten minutes you can produce more illustrations of any point in English grammar than you will find in many million words of random text. (Francis 1992: 17f.) The problem Francis had with defending the merits and values of corpora for describing real language use, are well known to any corpus linguist. In addition, noncorpus linguists might be on Lee’s side as well. However, since the first corpus was used to find out about how people really use language, we have seen that nativespeaker intuition is not really a good source for describing how people really use 10

language. The observations linguists have made with the use of corpora have repeatedly shown that there are routines in language use that no one was aware of until that time. Two examples: 1.) A lot of grammar books tell you that you should never use would in if-clauses. What the ICE-GB corpus brought to light though was that would IS in fact used in ifclauses, namely in combination with I would be grateful if. a) I would be grateful if you would contribute in the maintaining of the Squadron’s morale. b) I would be grateful if you would let me know how much it would cost […] c) I would be grateful if you would supply me with […] d) I would be grateful if you would let me have […] This example does not tell us that all grammar books are wrong. But it tells us that in this very combination with “X would be grateful if” the use of would in the subordinate clause is idiomatic.

2.) Some non-corpus-based dictionaries say that the two adjectives utterly and absolute are synonyms. What corpus linguistics found out is the following (extract from a concordance by Louw 1993: 160):

1

Farmers were

utterly against the union

2

the union and

utterly against the Wages Board

3

never seen,

utterly blackened now

4

it gets

utterly confused

5

are not

utterly convincing. Miguel

6

hopes appeared

utterly demolished in 1956

7

the view was

utterly different. The filmy

enchant 8

would be an

utterly different kind of

programme 9

not

utterly disconfirming the tale

10

which is

utterly meaningless to

11

think how

utterly obsessed by

12

avil or

utterly out of line

13

found it

utterly ridiculous

14

are

utterly stupid 11

15

I’d be

utterly terrified to go up

16

it is

utterly unreasonable to suppose

17

had been

utterly unsympathetic

Æ After observing the collocates to the right, it can be noted that utterly shows a negative prosody (see definition Semantic Prosody).

These two examples show that corpora provide a possibility to look at language that goes beyond intuition. Native speaker intuition alone would never have come to a conclusion like that. It is only with corpora that you come to find routines in language use like these. Corpora give a very good insight into colligations, collocations, lexicogrammatical patterns and the semantic prosody of a word.:

Def. Collocation: the statistically significant tendency for two or more words to co-occur together; such as example + of

Def. Colligation: collocation patterns based on syntactic groups rather than individual words I don’t know + WH word as in I don’t know WHere/WHy/WHat/WHen/WHo,…

Def. Lexicogrammatical Pattern: collocational strings: BE + difficult + for + NOUN GROUP + TO-INF

Def. Semantic Prosody: certain seemingly neutral words can come to carry positive or negative associations through frequently occurring with particular collocation. If you use the word provide then the result is one of a POSITIVE character (i.e. provide medical care, provide ideas, provide a service, provide an answer), whereas utterly carries a NEGATIVE one (utterly against, utterly destroying, utterly ridiculous, utterly unsympathetic, utterly stupid, utterly unreasonable,…,…)

Æ Corpora show us the “preferred way of putting things” and thus provide an insight into native-like selection. However, you can also “just”

look for the meaning of a word (by browsing through the concordance lines)

12

compare two similar words (see whether or not these seemingly similar words have a different semantic prosody, collocational or colligational preferences, etc…)

compare how a word is used in different kinds of texts o How do different authors use a particular word? o How did the meaning of a word change over time? o What makes Fachsprache Fachsprache? o How do learners of a language use this language, and where do we see typical mistakes, overuse, underuse, overrepresentation or under-representation of certain forms/words? (The analysis of learner corpora is a relatively new area in corpus linguistics, but it has already given some useful insights into language acquisition and language learning mechanisms, universal and language-specific mistakes.)

create word lists and see how often certain words are used

create keyword lists

1.5) Working with a corpus Before you start working with a corpus, you have to decide which corpus will best serve your purposes. Although corpora can be a very valuable source for the improvement of language skills, it will not be sufficient to choose any corpus at hand or to just compile a corpus at random to work with. The selection of the right corpus for the right purpose is vital in order to get any meaningful results. So first of all, you should ask yourself for what purposes the corpus will be needed. Do you want to create your own corpus or do you want to work with corpora that have already been compiled? There is quite a variety of corpora you can choose from. Every single one of them is useful for a specific purpose. It all depends on what you want to analyse. There is a checklist of important questions you have to ask yourself before you start working with a corpus.

Important questions: 1. Do I want to create my own corpus or use a corpus which has already been compiled? If I choose to compile my own corpus, are there any copyright issues which I have to consider? 13

2. Do I need any of the existing corpora as a reference corpus? 3. Which kind of genres do I want to include in my analysis? 4. Do I want to make a synchronic or a diachronic analysis? 5. Do I want to look at American, British, or other varieties of English? 6. Do I want to look at spoken or written language, or both? 7. How big does the corpus need to be?

Once these questions are clarified, it should be much easier to limit the choice of corpora that are really helpful for the respective analysis. As a next step, you should examine whether the corpora you need for your analysis are available, and if not, whether it is possible to gain access to them.

1.6) Why use corpora in the classroom? The inclusion of corpora into the English as a Foreign Language (EFL) classroom is a concept which has been promoted by numerous linguists for many years. As early as the beginning of the 1990s, Johns & King developed corpus linguistic exercises for students. They called this form of learner-centred autonomous approaches to language learning with the help of software and computers Data-Driven Learning. They defined this concept of DDL as follows:

[Data-driven learning is] the use in the classroom of computer-generated concordances to get students to explore regularities of patterning in the target language, and the development of activities and exercises based on concordance output. (Johns & Kings 1991) The intention behind DDL is that teachers guide their students to discover important features of a foreign language on their own. This contrasts with the traditional teachercentred concept of rule-driven learning, in which emphasis is placed on grammatical rules and students cannot discover grammatical features on their own. In rule-driven learning examples are given only to foster grammatical rules. Leech (1997: 10) describes how students should ideally study language, and stresses the importance of the notion of students as researcher.

The critical and argumentative type of essay assignment [...] should be balanced with the type of assignment [...] which invites the student to obtain, organise, and study real-language data according to individual choice. This latter type of task gives the student the realistic expectation of 14

breaking new ground as a “researcher”, doing something which is a unique and individual contribution, rather than reworking and evaluation of the research of others. The students become researchers who actively contribute to the class session with their findings. In DDL teachers are not solely responsible for the class sessions anymore, but they should rather be considered as coordinators who assist the students in doing research. In the ideal case, students will systematically learn how to observe and interpret patterns in the foreign language. This can best be achieved by the exploitation of corpora, since corpora reflect language which is used by real users of a language. With the observation and analysis of massive data output from corpora, the students will then be able to develop rules for general language use.

1.7) Five steps to achieve corpus literacy The acquisition of some kind of ‘corpus literacy’ is the central prerequisite for a successful implementation of DDL activities in the English language classroom. But what does corpus literacy actually refer to? – Before learners can successfully use corpora for their own purposes, they have to be familiarised with all the various tools for corpus analysis. Thus, teachers have first of all to show their students all significant operations and, as a next step, have them perform them in practice. The more time learners spend on corpus work, the better their corpus skills will be. In the ideal case learners will not need the guidance of their teachers anymore and will continue to work on corpus projects on their own. This ideal has been called corpus literacy by Mukherjee (2002: 179). For him this term depicts the learners’ competence to work autonomously with a corpus. It should be a desirable goal for every teacher to teach their students corpus literacy. Weskamp (2001: 82) focuses on the improvement of learners’ autonomy in general and proposes five steps in order to reach it. These five steps or phases can also be applied with regard to the concept of corpus literacy: 1. Raising awareness 2. Involving the learner 3. Intervention of the learner 4. Autonomy of the learner 5. Usage outside of the classroom These five steps will now be explained in more detail, with concrete examples of how they can be applied in order to gradually achieve corpus literacy. 15

1. Raising awareness

As a first step, the learners must be familiarised with fundamental concepts in corpus linguistics, such as authenticity and representativeness. Furthermore, the most important corpora and tools which help to exploit them should be introduced. Finally, the teachers should highlight that language routines, such as lexicogrammatical patterns, are very important for learners of a language if their aim is to use the foreign language more idiomatically.

2. Involving the learner

Once the learners have gained basic insights into corpus linguistics, it is time for them to work with corpora on their own. In this phase, however, teachers still have to guide them and be prepared to help the learners in case they encounter any problems. It makes perfect sense that in this first step of practical work the teachers instruct the learners as to what to examine and how to proceed. Thus, they can ensure that they will get some results and help them interpret their findings correctly.

3. Intervention of the learner

Once the learners feel safe enough to perform the most important corpus operations, they should be enabled to participate in the choice of linguistic phenomena to be looked at and of the methodology with which to analyse them. In addition, the learners should have the freedom to perform further analyses based on their findings.

4. Autonomy of the learner

In this phase the teachers’ guidance should be reduced to a minimum. The learners themselves should determine their research questions and the methods they want to apply. Subsequently, they should work with corpora autonomously and present their findings. In the end, they should be able to draw the right conclusions from their findings.

5. Usage outside of the classroom

If the learners have reached the stage in which they can autonomously work with corpora, it has already been a great achievement for the teachers. However, the ultimate goal should be that the learners use corpora outside of the classroom setting as well. If the learners comprehend how they can exploit corpora in other areas, the 16

teachers have been entirely successful. If it goes without saying that learners use corpora as a supporting tool in addition to dictionaries or if they use them to check their own writing style or to improve their communicative competence, then the full potential of corpora has been exploited.

1.8) Integrating corpora into the classroom Now that we have hinted at how corpora can potentially be exploited, it is time to examine how this theoretical framework may actually be applied in the EFL classroom from the perspective of a language teacher. For this purpose, we will introduce four phases which will help teachers to successfully implement corpora in the EFL classroom:

2. Selection phase 1. Planning phase

3. Application phase 4. Evaluation phase

Four phases of corpus integration

These phases are highly interdependent and will be introduced in detail in the following:

1. Planning phase As a first step, teachers have to contemplate which goals they want to reach by using corpora in the classroom. Do they intend to let their students actually work with a corpus, or will they rather let them work with material they created, based on corpus findings? If they are interested in letting their students analyse their own mistakes, they may, for example, create a corpus consisting entirely of the students’ English essays and compare this corpus with a reference corpus of native speaker English, such as the ICE-GB corpus. If they want to make their students aware of a specific type of language, like the language of political debates, they can compile a corpus consisting only of political debates and let the students analyse them. If they are interested in enhancing the students’ business vocabulary, they may compile a corpus consisting solely of business texts and create a keyword list so that mostly specific business terms would be listed. 17

After they have decided on their goals, they have to consider how much time they want to spend on corpus work. In many cases teachers are restricted by their curriculum because a certain amount of topics has to be covered in the course. This can sometimes pose a problem, since corpus work requires time. Students have to become acquainted with corpora and need to practice working with them if their aim is to reach corpus literacy. This cannot possibly be managed within the time frame of a single session of 90 minutes. Time is always a critical factor and teachers will have to decide whether or not they are willing to spend several sessions on corpus work. Sometimes there might not even be the technical possibility of letting students work with corpora, e.g. if the university’s hardware capacities are not sufficient. However, computers are obligatory for corpus work. Therefore, a computer lab is needed if students are supposed to work with a corpus on their own. In addition, teachers must have access to the kinds of corpora relevant for their courses, which can sometimes cause problems because the majority of reference corpora have to be purchased. Facing some of these problems, teachers might decide that it is simply impossible for them to let their students directly work with corpora. Thus, sometimes it can make perfect sense to use corpora without letting the students browse them. Instead, teachers can prepare certain corpus-related tasks at home. This would be only a “small” solution, but it is certainly better to work at least with real-life examples than to take all tasks only from grammar books, in which examples originate from the writers’ intuition. In the ideal case, however, teachers will have at least two or three sessions at their disposal, so that their students will have plenty of time to become acquainted with the most important functions of linguistic concordancers and practice to compile and analyse their own corpora. Finally, they can work autonomously on their own corpus project. Once teachers have taken all these issues into consideration and outlined a basic plan, they have to select the corpora and the tools they want to use.

2. Selection phase As has been explained in chapter 1.2, there are many different types of corpora teachers can choose from. They all place a different emphasis on certain criteria depending on the individual purposes. Hence, once the teachers have resolved what goals they want to reach with the corpora and sketched out the time schedule, they have to examine closely what benefits the individual corpora offer and which 18

drawbacks they involve. In this case a checklist with some important questions might prove to be useful, because it can make the choice of the right corpora easier for the teachers. Some basic questions have already been presented in chapter 1.5. Having these questions in mind, it should pose no great problems for teachers to find the corpora best suited for their goals. Depending on the tasks they want to conduct, they have then to choose the tools which serve best to reach their goals.

3. Application phase Once the teachers have prepared the basic settings for their courses, they have to design the tasks they want to carry out in class. The decision which kinds of tasks to give has to be made carefully, since many factors such as time and equipment play a role. Do the teachers want to prepare corpora for their students? Or will they let them compile their own corpora? And, if so, where will the students obtain their texts from? From storage media on which the teacher has placed the texts, or from the Internet? In addition, teachers have to decide to which degree they want to introduce corpus work to their students. Will they present all functions of the concordancers, or will they concentrate only on a few functions which match the purpose of the course? In general, they will have to decide on the setup of their corpus sessions. How much time will they devote to theoretical, how much to the practical parts? Do they want to hand out task sheets to help their students carry out the practical tasks? Lastly, they have to decide on the methods of corpus work. Do they want to conduct individual tasks or rather group tasks? Are their students supposed to work autonomously on a corpus project and present their findings? The teachers’ decision should be strongly based on the educational level of their students. The more advanced they are, the more freedom they should have to browse corpora and work on corpus projects on their own.

4. Evaluation Phase Once the corpus sessions are over, it is crucial for teachers to review if the application of corpora in the classroom has been successful. For this purpose, the students should be enabled to state their opinions about whether they regarded the use of corpora as helpful. Furthermore, teachers should observe if the corpus work has had any impact on their students’ motivation to participate actively in their courses. This feedback will give them a more critical insight into their own performance and into potential 19

mistakes which were made in the course of their class. Consequently, they may reconsider some of their approaches in order to avoid such problems in future courses. Ultimately, the evaluation phase always precedes the next planning phase, which results in a circular sequence of actions. Now that we know for which purposes corpora are needed, it is time to explain how these corpora can be exploited. In the next chapter a short introduction of the linguistic concordander AntConc and some important corpus linguistic definitions will be provided.

20

2) Using AntConc (Version 3.2.1w) step by step: AntConc is a linguistic concordancer which is freely available on the Internet and hence a very interesting tool for teachers and students alike. It can be downloaded from the homepage http://www.antlab.sci.waseda.ac.jp/. Although there are many operations that AntConc is able to perform, only the most important of these will be explained here, so that basic corpus research can be conducted. For a more detailed description of AntConc’s features, you may consult its manual. As a first step, the program has to be downloaded from the homepage and installed on the server. Then it can be started and is ready to be used.

2.1) Upload the relevant texts In order to analyse one or more texts, first of all these texts have to be uploaded to AntConc. Click on ÆFile, ÆOpen File(s). Upload the texts you wish to analyse. You have to choose the text you want to work with and open it.

1. Choose text

2. Open it

2.2) Concordance Now that you have uploaded the texts you want to work with, you can choose from a variety of functions. The first of them is the Concordance function. With the help of the concord feature, specific concordances can be analysed.

21

The main point of a concordance is to be able to see lots of instances of a word or a phrase, in their contexts. You get a much better idea of the use of a word by seeing examples of it, and it is by seeing or hearing new words in context several times that you come to grasp the meaning of most of the words in your native language. It is also by seeing the contexts that you get a better idea about how to use the new word yourself. A dictionary can tell you a word's meaning(s) but it is not always very good at showing you how to use the word.

After having chosen the relevant texts, now you can type in a search word, which will be searched for in all the text files you have chosen. It will then present a concordance display, and give you access to information about collocates of the search word, dispersion plots showing where the search word occurs in each file, cluster analyses showing repeated clusters of words (phrases), etc. Type your word into the 'search term' field and click on ÆStart.

1. Type in a search word

2. Click on Start

You will instantly get an unsorted list with the Kwic (Keyword in context), the word you search for, in the middle of the page. In order to be able to examine certain patterns better, the concordances can be sorted. You can do it with the Kwic Sort function, sorting your hits according to certain criteria. For example, if you click on Æ1R under Level 1, all hits will be alphabetically sorted in regards to the first word to the right of the search word you entered. You can additionally resort it if you also Æactivate Level 2 and Level 3. You 22

can then re-sort it to 2R, the second word to the right, etc. Click on ÆSort and the hits will be re-sorted.

Unsorted

4. Click on Sort

1. Resort it to your criterion

2. Activate 3. Resort Level 2 it further

Sorted alphabetically

Using Wildcards Sometimes you may not solely want to analyse a specific word, but also its different forms. Let us say, for example, that you want to see all possible forms of “start”, like “start", "starts", "starting”, etc. In this case, you can use the wildcard symbol “*”.

23

Thus, if you type in Æstart* as a search word, all the different forms of “start” will be listed.

File View File View allows you to look at the whole context of the Kwic you look at. You can Æclick on one of the instances on your list and you will automatically come to the file view function.

Click on an instance

24

In the file view function you can see the entire text. If you want to get back to the concordance screen, click on ÆConcordance.

Click on Concordance to get back to the main screen

Clusters Clusters tell you what words your search word co-occurs with. In the 'clusters' screen the instances of all the clusters will be shown, ranked according to their frequency.

Collocates 'Collocates' tell you with which kind of words your search word co-occurs. It will give you a list of collocates sorted according to their absolute frequency (Freq) and show you whether they are found to the left (Freq L) or to the right to the word (Freq R). 25

The collocates are ranked according to their frequency

Æ “for example”: 219 instances Æ “example of”: 71 instances

Concordance Plots Sometimes you might want to look at where exactly in the text your search word occurs. Concordance Plots can do this. Just Ætype in the search word and click on ÆConcordance Plot. You will see a new screen with a horizontal bar. This bar represents the whole text. The left hand-side is the beginning of the text, while the right-hand side is the end. Every time the search word occurs, there will be a black bar; if the word occurs very frequently in a certain place, there will be chunks of black bars. For example, you can look where the character Laertes in Shakespeare's drama Hamlet occurs.

The black bars indicate where the word is used

Chunks of black bars indicate frequent usage of the word

As a result, we see that the search word "Laertes" is used rarely at the beginning of the text, not at all in the middle part and very frequently towards the end. Clean the screen and choose other texts If you want to clear your results, you have to click on ÆFile, ÆClear Tool. The page will then be blank again. 26

If you want to work with other texts, click on ÆFile, ÆClose File. Then you can choose another text that you have to upload.

Tagged files Sometimes the files you are working with in AntConc are grammatically tagged. These tags will also occur in your findings unless you specify that you do not want them to be shown. These tags are usually numbers, symbols or single letters. Under ÆGlobal Settings, ÆTag Settings, you can Æclick on Hide tags in order to make these tags disappear.

2.3) Word List Making a word list or performing a keyword analysis can be useful for various linguistic activities, e.g. language teaching, stylistics, content analysis, forensic linguistics, and information retrieval. This function provides frequency information 27

about the vocabulary of the text as a whole – unlike concordancers, which focus on individual words and phrases at the level of the sentence or paragraph. While some operations like concordancing can also be performed to a certain extent by concordancers like Google, the features of linguistic concordancers go much further. In contrast to Internet search engines, linguistic concordancers can generate word lists and keyword lists. Word lists are listings of all the words in a text according to their frequency. In his manual for WordSmith, Mike Scott, the creator of the linguistic concordancer WordSmith Tools, explains the main reasons for the creation of word lists. Word lists can be used: 1. simply in order to study the type of vocabulary used; 2. to identify common word clusters; 3. to compare the frequency of a word in different text files or across genres; 4. to compare the frequencies of cognate words or translation equivalents between different languages; 5. to get a concordance of one or more of the words in a list. (Scott 2004) A typical word list is presented in the figure below. The words in the list are ranked according to their frequency. In general, the first hits in every word list are always function words, like articles, prepositions, conjunctions or pronouns. Words with lexical meanings usually follow further down the list.

Figure 2. A typical word list

28

Before you can make a word list, the relevant texts have to be uploaded, as described in 1). Then you can click on ÆWord List. Click on ÆStart and a word list will appear.

1. Click on Word List

2. Click on Start

The most frequent words are at the top of the list. The less frequently a word occurs, the further down it will be in the list. Frequency tells you how often a word occurs in the text. The word list can be saved via ÆFile, ÆSave Output to text file.

Words and their frequencies

2.4) Key Words Another operation a user can perform with word lists is to compare two lists for stylistic purposes and create a keyword list. One of them is assumed to be a large word list which will act as a reference corpus. The other one is the word list which will be analysed, which is automatically assumed to be the one created from the smaller of the 29

two corpora chosen. The intention is to find out which words characterize the corpus the user wants to analyse. The larger word list will provide background information for reference comparison. The concordancer analyses all the words in the corpus and compares frequencies with the reference corpus. The result is a so-called keyword list. But what does the term 'keyword' actually mean? In his manual Scott (2004) states: “[k]eywords are those whose frequency is unusually high in comparison with some norm”. Thus, the aim of the keyword function is to find words that are comparatively more or less frequent in one corpus than in the other (reference) corpus. A keyword list can fulfil various useful purposes. Keywords enable us to reveal patterns of 'aboutness' and stylistic features of texts. Keyword lists can also serve to create vocabulary lists. For this purpose, very specific texts can be sampled for the compilation of such a special corpus. For example, one could compile a corpus consisting exclusively of business texts. If this corpus were then compared with a reference corpus, a list of all the words with important “key” terms for business language would appear. For the figure below, ten business texts were downloaded from the homepage of The Economist and put together to form a small business corpus. A keyword list was then created, with the ICE-GB corpus as a reference corpus. Not surprisingly, business terms and names of companies and their chief executives rank highest in the keyword list.

Figure 3. A keyword list of a business text corpus

30

This list can then easily be copied into MS-Word, where vocabulary lists can be created with the help of the 'table' function. In order to be able to produce a keyword list, first of all you have to upload the reference corpus. For this purpose, click on ÆTool Preferences, ÆKeyword List, ÆChoose Files and choose the reference corpus you want to use. Click on ÆApply. Now the reference corpus is uploaded.

1. Click on Keyword List

2. Click choose files

3. Choose your reference corpus and open it 4. Apply

As soon as you have uploaded the corpus you want to analyse, you can click on ÆKeyWord List and ÆStart, and a Keyword list will be made.

1. Upload the corpus

2.

3.

The result will be a Keyword list in which the frequency and the 'keyness' of the words will be presented. The higher the keyness of a word, the more often it is used in the corpus you are looking at in relation to the reference corpus. If a word occurs very frequently in the first corpus, but hardly or not at all in the reference corpus, the 31

keyness of the word is considered to be very high. The positive keyness of a word would mean that it occurs more often than would be expected in comparison with the reference corpus. If a word has a negative keyness it occurs less often than would be expected in comparison with the reference corpus. In general, it can be said that the higher the keyness value of a word is, the more unusual its frequent usage will be. The Keyword List can be saved via ÆFile, ÆSave Output to text file.

Frequency and Keyness

Ranking of Keywords

In this screenshot many business-language related words occur. The high keyness of these words indicates that words like “company”, “executive” or “CEO” occur more often than would statistically be expected. Of course, this has to do with the fact that this particular corpus includes only business texts, while the reference corpus covers all kinds of genres.

2.5) Create your own corpus Converting a word document into a text file WordSmith and most other corpus processing tools are designed to work on plain text files1 (also known as ASCII files). MS-Word documents have formatting information encoded in the text, so that in order to use Word documents for text processing in WordSmith and other corpus software we should convert them into plain text files. Thus, if you want to analyse a MS-Word text, you need to Æopen it and Æsave it, choosing the data type Æplain txt (“.txt”); afterwards you will be able to upload it to WordSmith and work with it. Of course, you can also analyse your own written texts.

1

AntConc can also work with other formats. However, for simplicity's sake the files can also be converted to plain text files.

32

Just use MS-Word, write your text and save it as “.txt”. The MS Word file could not be used in WordSmith, but the new plain text file can.

Scroll down and save it as “Nur Text”

Of course, you can also copy and paste a webpage. As long as it is saved as “.txt”, WordSmith will be able to analyse it. In the example below, the Website www.huxley.net was visited and the text of Brave New World was downloaded into MS-Word.

33

Again, as a second step the file has to be converted into a plain text file. For this you have to click on ÆSpeichern unter, ÆDateityp, ÆNur Text. The text is now ready to be uploaded into a linguistic concordance program.

Due to the conversion to plain text all pictures that might have been in the Word file will be deleted and only the plain text will be left. The result can be seen below.

34

2.6) Useful definitions: Annotation: (linguistic) information, such as POS tags or syntactic parsing that is added to a text/corpus ÆTo annotate: to provide text with annotations

Colligation: collocation patterns based on syntactic groups rather than individual words. (Ex.: “I don’t know WHAT/WHERE/WHEN/WHY” = I don’t know + whword) Collocation: a pair or group of words which tend to occur together Æ To collocate: to appear together, Æ Collocates: words that appear together. (In the collocations 'apple tree', 'apple pie', and 'Adam's apple', 'apple' collocates with 'tree', 'pie', and 'Adam's'. They are collocates.)

To compile: to collect and put together (for example, texts for a corpus)

Concordance: a list of words (called keywords or node words), taken from a piece of authentic language (a corpus), displayed in the centre of the page and shown with parts of the contexts in which they occur. Usually printed as a Kwic display

Concordancer: a program that searches a corpus for a selected word or phrase and presents every instance of that word or phrase in the centre of the computer screen, with the words that come before and after it to the left and to the right

Context: here, usually, the words surrounding a hit

Corpus (pl. corpora or corpuses): a collection of text, now usually in machinereadable form and compiled to be representative of a particular kind of language and provided with some kind of annotation

Encoding: annotation

Frame: sequences of (usually three) words in which the first and the last are fixed but the middle word is not. (Ex.: “a ... of”)

35

Hit: When your search string is found in the corpus, it is referred to as a hit or match

Key Words: are the words in the text which are unusually frequent

KWIC (KeyWord In Context): a form of concordance in which the hit is displayed with a certain amount of context, often presented with the hit in the centre of the page

Lemma: the set of different forms of a word, such as the inflected forms of a verb. Ex. 'sing', 'sang', 'sung' are one lemma, 'boy', 'boys' another

Lemmatisation: the process or result of dividing a text into lemmas

Mark-up: codes used to provide information about a text, such as POS tags, SGML codes, etc.

Match: when your search string is found in the corpus, it is referred to as a match or hit

Natural language: term used for human language, as opposed to artificial languages used in computer programming and formal logic, for example Parsing: the process or result of making a syntactic analysis ÆParser: tool (often an automatic or a semi-automatic computer program) used for parsing

Parsed corpus: a corpus that has been syntactically analysed and provided with annotation representing the analysis

Part-of-speech (POS): word class, such as verb, noun, adjective. ÆPart-of-speech tagging: assigning part-of-speech tags to a text

Probe: searching for sets of words or expressions that cannot easily be called to mind otherwise. (Ex.: “something + adj + about + him/her”)

36

SGML: (Standard Generalized Mark-up Language) mark-up system used for electronic text

String: combination of letters/characters

Tag: a label associated with a word (or other unit) providing information about the word, or the process of assigning tags. See annotation. Ex: 'run' can be tagged as a noun (run_N) or verb (run_V) ÆTagging: the process or result of assigning tags Thin: remove certain hits, either automatically or manually ÆThinning: the process or result of removing certain hits, either by selecting the desired ones, selecting the ones to discard or by selecting/discarding a set amount of hits

Token: individual word. Compare type

Type: word form. “I see a cat and a dog” contains seven tokens but only six types (the type 'a' occurrs twice)

Word Form: a specific grammatical form of a word in the context of a sentence. (Ex.: “go” Æ“go”, “going”, “went”)

Word List: a word list is simply a list of all the words in a text, usually sorted by frequency

2.7) Other Concordancers Since corpus linguistics has been a growing field in recent years, more and more software for corpus analysis has been developed. WordSmith and MonoConc are just two examples of alternative concordancers that can be used. However, there are also other kinds of concordancers, e.g. online concordancers. These are concordancers which can be processed via the Internet and which work with prefabricated corpora. The Compleat Lexical Tutor (CLT) and the BNCweb are two of them which can easily be accessed. For the BNCweb a password is required which can be requested in the English linguistics department.

37

2.7.1) The Compleat Lexical Tutor In contrast to AntConc, the CLT is an online concordance, which means that it can only be accessed via the Internet and that it does not need to be downloaded. It offers students many tasks which can help to improve their grammatical and lexical language skills. For our purposes, however, we will focus only on the concordance feature of the website. The screenshot below shows the concordance interface of the CLT. As can be seen, the user can type in a search word and look for concordances of the query. The CLT offers only a concordance function; word lists and keyword lists cannot be created. Furthermore, the user can only work with a limited choice of corpora which have been uploaded to the CLT website, such as the Brown corpus or parts of the BNC corpus. Other standard corpora or corpora which have been compiled by the user cannot be analysed. This is why the CLT is primarily helpful as a reference corpus in order to check, for instance, a phrasal unit. If all English corpora on the website are consulted at the same time, it adds up to four million words, which is sufficient for finding out typical lexicogrammatical patterns.

Screenshot of the CLT concordance interface

2.7.2) The BNCweb The BNCweb is an online concordancer which uses the BNC as its database. In order to work with the BNCweb a license fee has to be paid. At the university, however, access can be granted via the English language department. The screenshot below displays the interface of the BNCweb.

38

Screenshot of the BNCweb interface

In contrast to the CLT, the BNCweb interface allows for a very specific search. The user can opt to search either the entire corpus or only the written or the spoken part of it. What is more, the user can make even more precise specifications, such as limiting the age of the respondents in the spoken part of the corpus. The BNCweb has been compiled very carefully, and much information has been provided in the annotations. As a result, the choice of texts which a user wants to analyse can be restricted to certain criteria. For instance, in the spoken part these include factors such as age, origin, sex and social class of the respondents. Once the Kwic is displayed, the results can be additionally thinned or re-sorted. Due to its special filter function, the BNCweb can help to observe very specific subgroups of language users. For example, the language spoken by women from the age of 15 to 24 can be examined. At the same time, its huge size allows it to be used as a reference corpus as well.

39

3) The practical application of corpora in the classroom In the following chapters I will give some practical examples of how corpora can be exploited in various different fields of English courses and for different purposes.

3.1) Corpora in linguistics In general, the discipline of English studies which springs to mind first when thinking about corpus work is linguistics. Corpora have been used first and foremost by linguists for research purposes. They have often enough proved to be useful for their analyses. Therefore, it makes perfect sense that students in linguistics classes are familiarised with them, especially if they are required to conduct a linguistic analysis for a term paper or for their final thesis. Depending on the individual linguistics class, the value of corpora can vary to a certain extent. Still, there are some disciplines in linguistics which can benefit greatly from corpus findings, one of them being sociolinguistics. In sociolinguistics corpora can effectively help to trace differences between individual speaker groups, for instance men and women, young and elderly people or speakers from different social backgrounds. Concordancers like AntConc can be used to find these differences. The BNCweb may be used in addition, since it allows for the examination of the speech of very specific subgroups. I will now present several basic examples of corpus linguistic studies.

3.1.1) Examining football language In the first example I compiled a corpus consisting of texts which were gathered from an online fan blog of the English football team Liverpool FC. The corpus had an overall length of roughly 12.000 words. The ICE-GB corpus was chosen as a reference corpus. As a first step, I created a keyword list of this special corpus. The results can be seen in Figure 1.

40

Figure 1. Keywords of the Liverpool blog corpus

The keyword list of the Liverpool blog corpus (cf. Figure 1) reveals certain interesting aspects. A closer examination of the first words in the list exposes that they are not of any particular help. The words ranking highest in the list “pm” (no.1)2, “Says” (no.2), “admin” (no.3) and “Sep” (no.4) belong all to the category of “metainformation”, which has nothing to do with the content of the blog. For instance, every time someone leaves a comment in the Liverpool blog, the time of the comment with the attachment “pm” is displayed. The findings of Figure 1 already yield important insights for the researcher: With certain corpora they have to pay attention and check individual hits in order to ensure that these words are really of any use. The example above demonstrates that special caution is recommended with corpora which have been single-handedly compiled. The next hits in the list give at least a basic notion of the most important concepts in the Liverpool blog corpus. It is not really surprising that words which occur very frequently in the context of football like “club” (no.6), “game” (no.8), “players” (no.9), “football” (no.12) or “team” (no.13) rank high in the list. Neither is it unexpected to find names like “Rafa” (no.21), “Gerrard” (no.22) or “Benitez” (no.23) ranking high in the list, being the names of the players and of the coach of the football team. However, there are also special terms which occur exclusively in the Liverpool blog corpus and which do not occur at all in the reference corpus, such as “kopite” (no.42), “Mancs” (no.92), “yank” (no.102) or “gobshites” (no.154). The question is, what do these words mean? Since apparently standard corpora cannot help us deduce the meaning of these words, we 2

The number indicates the rank of the word in the respective keyword list.

41

have to turn to other means, such as consulting a dictionary. However, even the consultation of a standard monolingual dictionary like the Oxford English Dictionary (OED) does not help us, because entries for these words cannot be found. The advantage of linguistic concordancers is that they allow the whole concordance lines to be displayed, which might help its users catch the meaning of words they do not know. In this case, the context of these words showed that they must all be rather colloquial or vulgar words, many of which can be found in the corpus. In addition, the corpus also contains words which bear a special reference to the context of football, such as “rotation” (no.211), which can be found in the lexical unit “rotation policy”. The keyword list was intended to serve as a starting point for further investigations. It was supposed to examine if the language used in football blogs differs to a great extent from the language used by a typical native speaker of English. A glance at the keyword list can already sensitize us to differences in the language use of speakers from such a specific social subgroup . After the creation of the keyword lists it can make perfect sense to browse the corpus for any unusual patterns and check if these patterns may be recurring and hence a typical element of this special type of language.

3.1.2) Examining Hip Hop language I created a second corpus which differed consideratly from the Liverpool blog corpus. For this corpus the Internet was randomly searched for English interviews with musicians from the genre of gangsta rap music and a “rapper” corpus was compiled. This corpus consisted of approximately 5.000 words. Again, I wanted to discover any unusual lexicogrammatical patterns, which may be typical of this kind of language. The ICE-GB corpus was also chosen as a reference corpus. Hence, a keyword list was created as a starting point for the analysis. As the keyword list indicates (cf. Figure 2), many high-ranking words had to do with the interviewed rappers’ music; this is why “album” (no.1), “rap” (no.3) or “albums” (no.8) occur so many times. In addition, similar to the Liverpool blog corpus, the names of the most prominent figures in these texts also rank very high, such as “Kurupt” (no.9), “Snoop” (no.10) or “Loco” (no.12), which were either the interviewees’ names or the names of famous rappers the interviewees talked about. However, words like “alot” (no.2), “wit” (no.11) or “tha”

42

(no.18) also rank high in the list. A closer examination of “alot”, “wit” and “tha” illustrates that these are just slang3 forms of the words “a lot”, “with” and “the”.

Figure 2. Keyword list of the Rapper text corpus

Words like “homies” (no.33), “style” (no.37) and “gangsta” (no.55) also rank high and appear to be important words in the context of rap music. Another aspect which could evidently be noticed was that seemingly many swear words were frequently used. Words like “shit” (no.2), “fuck” (no.45) or “nigga” (no.60) are just a few examples of this phenomenon. As with the Liverpool blog corpus, the keyword list of the rapper text corpus sensitized me to the main differences between “rap language” and standard language. With its help I was able to survey specific words in order to find out if their usage deviated somewhat from standard language. As a result, I was then able to see that some of the words in the Rapper text corpus were used differently than in standard language. For instance, the lexeme “cats” (no.19) did not have the sense of a small domesticated carnivorous mammal, but was rather used as a synonym for “guys”, with a positive connotation. After a closer look at the concordances I also realized that the grammatical structures in the rapper text were at some points completely different from standard language and were not what learners of English would consider “grammatically correct” English. Sentences like the following, expressed by one of the interviewed rapper, reveal some major grammatical 3

In this context, slang refers to the use of informal words and expressions to describe an object or condition.

43

differences: “He’s like everybody else need to step they game up and do some deals with these companies they dealing with and drop some good albums”. The fact that “everybody” is used with a plural verb form is very typical of the rapper corpus in general; singular word forms are used with plural markers and vice versa. Moreover, “they” replaces the standard possessive pronoun form “their”. Another grammatical difference which can be seen is that the present progressive form “are dealing” is not complete, because “are” is simply omitted. This example shows that as little as one sentence can suffice to illustrate that the language of a specific social group can deviate greatly from standard language. If you examine the whole rapper corpus, you will find many more lexicogrammatical patterns and unusual words which will support these observations. These two examples illustrate that everyone can easily compile their own corpus and that the compilation of such a special corpus enables us to analyse a very specific subgroup of a society. In the following, I want to show a practical application of the BNCweb concordancer, because in particular due to its text restriction function this corpus can be of great help for sociolinguistic purposes.

3.1.3) Observing gender differences in language use In the next example I compared the frequency of certain words uttered in spoken texts by men and women. I looked for the words “beautiful”, “lovely”, “attractive”, “shit”, “bastard” and “hell”. The whole idea behind this analysis was that I wanted to examine whether the general prejudice was true that men use more swear words and women use more compliments and positive adjectives. BNCweb allows you to distinguish between men and women. This is why these words could individually be checked and put together into a table. The result of the analysis is presented in Figure 3. It is important to note that when working with the BNCweb the absolute numbers are not important, but the results are displayed in “instances per million words”. Absolute numbers do not tell you anything because the length of the individual texts can differ greatly.

44

Word Beautiful Lovely Attractive Shit Bastard Hell

Female Speakers 115.89 476.15 11.39 150.86 34.16 234.21

Male Speakers 87.83 388.88 10.98 165.84 60.67 236.91

Figure 3. BNCweb: The use of certain words by female and male speakers4

As the numbers indicate, women do indeed use the adjectives “beautiful”, “lovely” and “attractive” more often than men do. At the same time, men use swear words more frequently than women do. However, the differences were not as great as was intuitively expected prior to the investigation. Sociolinguistics is actually one of the most fruitful areas for corpus work. Due to the Internet, the speech of very specific social groups can be analysed. Researchers only have to search for their homepages and download texts as they see fit.

3.1.4) Analysing a variety of a language The fourth example is interesting for more advanced students of English. The focus of this example is on English phraseology. In general, lexicogrammatical patterns are crucial for learners of a language in order to reach idiomaticity. For this example, I compiled a corpus of Indian English, consisting of texts from the Indian newspaper Punjab Express (cf. www.punjabexpress.com). In order to obtain as many idiomatic local expressions as possible, I only included texts about local news in the corpus. Then I compared this Indian newspaper corpus with a standard corpus (the ICE-GB corpus) and looked at lexicogrammatical differences between Indian English and British English. Thus, first of all, a keyword list was created in order to get some initial ideas. A glance at the keyword list revealed unusual words which do not exist in British English such as “crore” (no.30), “kanals” (no.32) or “mediapersons” (no.67), ranking high. These words do not occur in British or American English because they are exclusive Indian English words. For instance, crore is a unit in the Indian number system, signifying “ten million”. Indians apparently structure their number system differently from Western civilizations. Since no separate term for “ten million” existed in British English, crore was taken over into Indian English from Sanskrit. However, I wanted to find significant differences between British and Indian English with regard

4

Instances per million words

45

to lexical patterns and grammatical structures as well. Indeed, some observations could be made, as can be seen in Figure 4. For example, I found out that a construction such as “Son of a city-based developer and film distributor Subash Nanda, Gagan...” (cf. Figure 4) was perfectly acceptable in Indian English, while it would not be acceptable in British English. This finding made me realise that British and Indian deviated not only with regard to their lexis, but also concerning their grammatical structures.

Figure 4. Concordance line display of the Indian newspaper corpus

3.1.5) Checking the prosody of a word With the next example I want to set the focus on another important operation which can be performed with concordancers. Since phraseology is very important for language learners, it seems indispensable to demonstrate how to look at clusters and collocates. In this example, I want to check the prosody of the verb “commit”. For this purpose, I took the Freiburg-Brown Corpus of American English (FROWN), which consists of one million words of American English from the early 1990s, as a reference corpus. Using the 'cluster' feature in AntConc, I looked at the word “commit” with the intention of finding out if it has positive or negative prosody. The result can be viewed in Figure 5. The cluster function offers a listing of all clusters which occur with the search query, ranked according to their frequency.

46

As can be seen from the screenshot, the word “commit” evidently has a very negative prosody. It is used in the context of “commit suicide” (no.2), “commit adultery” (no.11), “commit crimes” (no.15) or “commit terrorism” (no.23). The findings show that “commit” is usually used by native speakers only in negative contexts. Hence, learners of a language should avoid using “commit” in positive contexts if they want to use the word more idiomatically.

Figure 5. Clusters containing “commit”

3.1.6) Surveying the language of companies’ mission statements In this example, I had a look at mission statements of American companies. I wanted to observe if this particular type of language showed any special characertistics which distinguished it from other types of texts. For this purpose, I created a corpus which included 28 mission statements copied and pasted from the Internet homepages of the respective firms. In total, the corpus contained 1515 words. The FROWN corpus was then taken as a reference corpus and a keyword list was created. The hypothesis was that companies would select words and phrases in their mission statements very carefully and consciously. The results showed that words such as “integrity”, “responsibility”, “employee”, “team” or “committed” occurred frequently in the mission statement corpus. As a next step, the semantic prosody of the words “commit” and “integrity” was scrutinised in both corpora. I successfully proved that in the vast majority of cases in the FROWN corpus “commit” had negative prosody, while it had only positive prosody in the mission statement corpus. Similarly, “integrity” was used 47

differently in both corpora. While in the FROWN corpus it was used in a rather mechanical sense and as something which could easily be destroyed, it was always a feature of the company or state of the company with positive prosody in the other corpus. As a conclusion, the hypothesis proved to be correct. Furthermore, companies used words with ambiguous prosody only in positive contexts. This example mirrors very well the opportunities corpus-based work can offer. In this case, it is possible to attribute certain patterns to the language of the mission statement corpus which distinguish it from standard language.

3.1.7) Comparing phrasal constructions in Indian English, Hong Kong English and British English In another example, I compared the use of the phrasal verbs “discuss (about)”, “pay attention (on)”, “pick (up)” and “return (back)” in the ICE-GB, the International Corpus of English - India (ICE-India) and the International Corpus of English - Hong Kong (ICE-HK). The hypothesis was that some phrasal constructions could be used in Indian or Hong Kong English that would not be acceptable in British English. E.g., I found out that the construction “discuss about”, although it occurred only a few times, was used in Indian and in Hong Kong English but not at all in British English. The same observations were made for the phrasal verb “return back”. As a result, the hypothesis proved to be correct. This project is a good example of a small-scale study on the differences between varieties of English.

The diverse examples that I have given illustrate the fact that corpora can be used for a variety of purposes. Contrary to the belief of some critics, they are not merely restricted to the application in linguistics courses. They can also be applied in other areas such as didactics, ESL or literary studies, as will be explained in the following chapters.

3.1.8) Observing political texts in British newspapers I compiled a “Northern Ireland corpus” which consisted of political texts that were downloaded from various British Internet newspapers. The intention behind the analysis of the Northern Ireland corpus was to create a keyword list in order to get the gist of the most important political topics in Northern Ireland. The results are displayed in Figure 6. 48

A glance at the keyword list reveals that violence and faith are the most determining political topics in Northern Ireland. Words such as “Protestant” (no.6), “Catholic” (no.7), “violence” (no.12) or “paramilitary” (no.19) give indications of this phenomenon. A closer look at the concordance lines further reveals the strong interrelations between these two topics. Evidently, the names of certain people and political groupings such as “IRA” (no.3), “Ulster” (no.9), “Trimble” (no.13) or “Paisley” (no.52) also rank very high, which suggests their importance in current Northern Irish politics. All in all, it can be concluded that even many years after the IRA stopped its activities, terrorism in connection with religious faith is still an important political issue in Northern Ireland.

Figure 6. Keyword list of the Northern Ireland corpus

3.1.9) Consulting the Compleat Lexical Tutor Next I would like to show how the online concordancer CLT can be used to look for idiomatic constructions. I will just give a few examples: I thought of typical EFL learner mistakes which I checked with the concordance function of the CLT. The following constructions were checked 1. In spite, despite. Which one needs the preposition “of”? Æ Only: In spite of 2. Suffer. Which preposition is correct, “of”, “from”? Æ Only: from 3. Discuss: does it need the preposition “about”? Æ No! A typical concordance line display of the CLT can be seen in Figure 7.

49

Figure 7. Concordancing with the CLT: “suffer from”

While the programme provides 18 instances for the query “suffer from”, no instances at all were found for the lexical construction “suffer of”, as can be seen in Figure 8. This is a clear indication that the phrasal verb “suffer from” is correct, while “suffer of” cannot be used.

Figure 8. Concordancing with the CLT: “suffer of”

This example shows that the CLT can serve as a good supporting tool which can be consulted quickly. It is a valuable alternative to concordancers like AntConc or WordSmith.

3.1.10) Creating specific vocabulary lists Corpora can not only be used to analyse a language, but also to enhance the vocabulary of learners of a language in a relatively easy way. In this example I will expose how students can enhance their business vocabulary. First of all, I visited the homepage of the magazine The Economist (cf. www.economist.com) and download the first five texts in the subsection “United States”. Next, I converted them into a plain text file and created a keyword list taking 50

the ICE-GB as a reference corpus. After this was done, I looked at some rare words in the list. The file view function of AntConc allowed me to grasp their sense. Furthermore, I created a vocabulary list. For this purpose, I saved the keyword list as a text file and used tables in MS-Word in order to create a vocabulary list in a very convenient way. Of course, the same kind of procedure is possible with other text types. E.g., you may also download texts from biology homepages to enlargeyour biology vocabulary.

3.1.11) Checking your own writing style In this example, I want to demonstrate how corpora can be employed for checking and improving your own writing style. For students it is of the utmost importance to be able to reflect on their own writing style and on typical mistakes they made; otherwise they will not be able to improve their language skills. Linguistic concordancers can help users of a language to check their own writing style. For this you need to create a corpus which consists entirely of your own texts, with essays, term papers, etc. This corpus can then be compared to a native speaker corpus with the help of the keyword list function. The keyword list will then expose the words you typically overuse and which you should avoid in order to sound more fluent and idiomatic.

3.2) Corpora in literary studies Corpora still occupy only a marginal role in literary studies. They are often treated with scepticism as to what degree they can be helpful for the interpretation of literary texts. Thus, with the next examples I will demonstrate the benefits that can be derived from using corpora in literary studies. However, at the same time it must be pointed out that the use of corpora is neither a substitution for reading the book to be analysed nor for interpreting it. I will now concentrate entirely on two distinct ways in which to make use of AntConc. On the one hand, it may be used to examine an author’s style and, on the other hand, as supporting tool for the analysis of literary texts. For instance, AntConc can be employed to survey concordance plots; a typical concordance plot can be seen in Figure 9.

51

3.2.1) The use of concordance plots The principle of concordance plots is very simple: every time a certain word occurs in the text, a black bar is displayed at that point in the concordance plot. While in principle any word can be displayed, especially with regard to characters’ names in literary texts the concordance plot feature can be very helpful. In Figure 9, the upper screen depicts the occurrence of the name “Hamlet” in Shakespeare’s drama Hamlet.

Figure 9. Concordance Plot of “Hamlet” and “Laertes”

The lower screen depicts the character “Laertes”. As can be observed, “Hamlet” occurs considerably more often than “Laertes” does. This is already an indication that Hamlet is a major character of the drama, while Laertes plays a minor role. Interestingly, the name “Laertes” occurs almost exclusively in those parts of the text from which Hamlet is absent. If you want to make sense of this phenomenon, it is essential that you have read the text beforehand, since only then such an analysis will be of any value. Anyone who has read Hamlet will know that Hamlet is absent because at that point of the drama he is sent to away England with Rosencrantz and Guildenstern, while at the same time Laertes returns to Denmark from France. A comparison of the concordance plots will show that during Hamlet’s absence Laertes becomes the main protagonist of the story. Only when Hamlet returns is the situation reversed. The concordance plot also sheds light on another interesting aspect. Hamlet and Laertes do not meet throughout most of the drama. They meet for the first time 52

almost at the end of the drama, at the funeral of Ophelia. Only in the last scene do they appear on the scene together for a longer time, for the final duel. As this example has shown, concordance plots can be a valuable tool for the analysis of literary texts. First of all, simple as it may seem, they display at which point(s) in a text a name appears. Starting from that, the significance of characters and their development can be analysed. This can prove to be expedient especially with works of a larger scope, in which it would be very time-consuming to lok for every single appearance of a character.

3.2.2) An analysis of Shakespeare’s King Lear In the following example I examined William Shakespeare’s drama King Lear. A keyword list was created as a starting point for further analysis. Nonetheless, I had to take a different approach from the other examples. I could not use a standard reference corpus, since none of the corpora I had at my disposal could be used for comparison purposes with King Lear, which was written in the period of Early Modern English. In addition, standard corpora such as the BNC or the FLOB consist largely of nonfictional texts. Since the amount of plays by other authors from Shakespeare’s time which I had at my disposal was very small, I chose to compare King Lear with the rest of Shakespeare’s plays. Once I had created a keyword list, I searched for any unusual patterns. However, in this case it was a rather difficult task, since I had no anticipation for which patterns I was to search. Thus, I thought about the key issues of the drama. I remembered that the drama used many expressions from the semantic field of the five senses, such as “see”, “hear” or “feel”. Furthermore, there were also many words from the semantic field of nature in the drama. At this point, I tried to find some of these words with the concord function. After consulting the corpus I finally found some results. Indeed, the frequency of words such as “storm” and “nature” was evident. Nevertheless, despite the great number of hits, it posed a problem for me to interpret these findings. This is presumably closely interrelated with the fact that Shakespeare’s language is very different from contemporary language and thus difficult to analyse.

3.2.3) An Analysis of Wilde’s The Picture of Dorian Gray The next example deals with Oscar Wilde’s The Picture of Dorian Gray. Again, in the analysis of this text the basic methods I employed were identical to those I used for King Lear. Oscar Wilde’s entire works were taken as a reference corpus for the 53

creation of the keyword list of The Picture of Dorian Gray. The results are given in Figure 10.

Figure 10. Keyword list of Oscar Wilde’s “The Picture of Dorian Gray”

A first glance at the list reveals that, as with King Lear, many words ranking high in the list are taken from the semantic field of senses, such as “felt” (no.25), “senses” (no.31), “looked” (no.38) or “glanced” (no.42). It is also to be seen that moral plays an important role in the work. I observed that “sins” (no.66) was ranking high in the list and conducted a cluster analysis of this query. Figure 11 shows these clusters.

Figure 11. Cluster analysis of “sins” in Oscar Wilde’s “The Picture of Dorian Gray”

54

The consultation of standard corpora, such as the ICE-GB and LOB discloses that the words “sin” and “sins” are usually connotated in a very negative way and usually occur in the combination with words like “deadly” or “cardinal”. When looking at the cluster analysis, however, we can instantly see that “sins” is mostly positively connotated. Constructions as “beautiful sins” (no. 12), “great sins” (no. 14), “splendid sins” (no. 29) or “wilder sins” (no. 34) are instances of that phenomenon. Since standard corpora have clarified that “sins” has no positive connotation, the word’s use with positive contexts in the Dorian Gray corpus shows us that it clearly does not appear by chance; on the contrary, it is employed on purpose in The Picture of Dorian Gray and therefore a stylistic device. Another finding which proves this hypothesis is that Oscar Wilde does not apply this device in his other works. It is exclusively used in The Picture of Dorian Gray.

3.2.4) An analysis of Sterne’s A Sentimental Journey through France and Italy In this example I looked at a piece of British travel literature from the early 19th century. One of the most prominent texts from that time is A Sentimental Journey through France and Italy by Laurence Sterne, which was was chosen for the corpus analysis. Again, facing the problem that no other literary texts from that period of time were available, I took Sterne’s other works as a reference corpus.. Sterne’s novel was very successful after its release in 1768. It set off an avalanche insofar as it made travel writing the most popular genre of that time. However, this success had not been anticipated, since travel literature had already existed for a long time, with moderate success. With the help of a corpus analysis I tried to find some reasons for this success. Did it have something to do with the style of the author, or was it just the right book appearing at the right time? In order to find possible explanations, I created a keyword list. After I had created it, I began randomly to check some of the words ranking high in the list. In this process, the word “poor” (no. 35) caught my eye, and I checked the concordances for it. The results can be viewed in Figure 12.

55

Figure 12. Concordance line for “poor” in Laurence Sterne’s A Sentimental Journey through France and Italy

The concordance line reveals that in Sterne’s work “poor” was not used in the most common sense of having little money, as one might expect, but exclusively as an expression of sympathy for characters in the book. Once I realised this, I took a closer look at the description of many characters in the work and found out that in the course of the novel many words occurred which had to do with feelings and the exhibition of feelings. Thus, I noticed that the name of the book was apparently closely interrelated with its style. As a result, I am confident that at least part of the book’s great appeal for the readers must have had to do with the fact that, unlike former travel literature which had tried to be objective and not mingled with feelings, many words from the semantic field of emotions had been used. It was no longer aimed at describing the journey as objectively as possible, but rather as a subjective experience laced with the narrator’s personal feelings. Thus, Sterne created a new genre, mingling the characteristics of travel writing and of romantic fiction. As has been proved in this chapter, it is possible to apply corpus work to literary studies as well. Even without the concrete formulation of hypotheses on my part, I was able to reflect about possible directions in which I could do my research and to draw conclusions from the findings. The examples I have enumerated also prove that copyright issues are not necessarily a major disadvantage which makes it impossible to use corpora in literary studies. All of the texts I had worked with were freely available.

56

They could be downloaded very easily from the homepage of Project Gutenberg5, the first producer of free electronic books worldwide.

3.2.5) Further suggestions of what can be done with corpora The use of corpora is not limited to certain areas of linguistic research or to certain hypotheses. You may just as well compile a corpus of song lyrics and compare it to a reference corpus. You may download texts from blogs and forums and analyse subgenres of a language community; you may trace diachronic changes in language use, taking texts from different periods of time, etc. Nonetheless, you should bear one important point in mind. If you want to conduct a case study for a term paper or even for your master’s thesis, the crucial point is that you need a hypothesis prior to working with corpora. This hypothesis needs to be formulated first. In addition, it should be evident which tools and operations you want to use for your case study, and why you are using them. Your methods must be transparent, otherwise your results will not be convincing. It is only with the right methods that you will be able to effectively prove or reject your hypothesis.

5

cf. http://www.gutenberg.org/wiki/Main_Page

57

4) Bibliography Francis, W. Nelson (1992): “Language corpora B.C.”. In: J. Svartvik (ed.): Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82, Stockholm, 4-8 August 1991. (= Trends in Linguistics. Studies and Monographs 65), Berlin and New York: Mouton de Gruyter, 17-32.

Hunston, S. (2002): Corpora in Applied Linguistics. Cambridge: University Press.

Johns, T. & P. King (1991): Classroom Concordancing. Birmingham: University of Birmingham.

Leech, G. (1997): “Teaching and language corpora: A convergence”. In A. Wichmann, S. Fligelstone, A. M. McEnery, & G. Knowles (eds.), Teaching and Language Corpora, London: Longman. 1-23.

Louw, W. E. (1993) “Irony in the Text or Insincerity in the Writer? - The Diagnostic Potential of Semantic Prosodies” in M. Baker, G. Francis and E. Tognini-Bonelli (eds.), Text and Technology. In Honour of John Sinclair, Philadelphia and Amsterdam: John Benjamins, 157-176.

Mukherjee, J. (2002): Korpuslinguistik und Englischunterricht: Eine Einführung. Frankfurt am Main: Peter Lang.

O’Keefe, A., M. McCarthy & R. Carter (2007): From Corpus to Classroom. Language Use and Language Teaching. Cambridge: Cambridge University Press.

Scott, M. (2004): WordSmith Tools 4.0 Manual. http://www.lexically.net/wordsmith/version4

Weskamp, R. (2001): Fachdidaktik: Grundlagen und Konzepte. Berlin: Cornelsen.

58