40 0 574KB
Corpus linguistics: Method, Analysis, Interpretation What is corpus linguistics? Which software packages are available, what can they do?
The corpus approach harnesses the power of computers to allow analysts to work to produce machine aided analyses of large bodies of language data - so-called corpora. Computers allow us to do this on a scale and with a depth that would typically defy analysis by hand and eye alone. In doing so, we gain unprecedented insights into the use and manipulation of language in society.
What is Corpus Linguistics? Corpus linguistics, broadly, is a collection of methods for studying language. It begins with collecting a large set of language data – a corpus - which is made usable by computers. Corpora (the plural of corpus) are usually so large that it would be impossible to analyze them by hand, so software packages (often called concordancers) are used in order to study them. It is also important that a corpus is built using data well matched to a research question it is built to investigate. To investigate language use in an academic context, for example, it would be appropriate for one to collect data from academic contexts such as academic journals or lectures. Collecting data from the sports pages of a tabloid newspaper would make much less sense.
Software: A number of software packages are available with varying functionalities and price tags. Some pieces of software can be downloaded and used for free, others cost money or are available only online but have built-in reference corpora. This table an idea of the variety of software currently available:
Glossary Use this glossary, as a handy reference when you come across any terminology on the course that you do not understand. Produced by: The ESRC Centre for Corpus Approaches to Social Science (CASS), Lancaster University, UK
Annotation Codes used within a corpus that add information about things such as, for example, grammatical category. Also refers to the process of adding such information to a corpus.
Balance A property of a corpus (or, more precisely, a sampling frame) A corpus is said to be balanced if the relative sizes of each of its subsections have been chosen with the aim of adequately representing the range of language that exists in the population of texts being sampled (see also, sample).
Colligation More generally, colligation is co-occurrence between grammatically categories (e.g. verbs colligate with adverbs) but can also mean a co-occurrence relationship between a word and a grammatical category.
Collocation A co-occurrence relationship between words or phrases; Words are said to collocate with one another if one is more likely to occur in the presence of the other than elsewhere.
Comparability Two corpora or sub-corpora are said to be comparable if their sampling frames are similar or identical.
Concordance A display of every instance of a specified word or other search term in a corpus, together with a given amount of preceding and following context for each result or ‘hit’
Concordancer A computer program that can produce a concordance from a specified text or corpus; Modern concordance software can also facilitate more advanced analyses
Corpus From the Latin for ‘body’ (plural corpora), a corpus is a body of language representative of a particular variety of language or genre which is collected and stored in electronic form for analysis using concordance software.
Corpus construction The process of designing a corpus, collecting texts, encoding the corpus, assembling and storing the metadata, marking up (see markup) the texts where necessary and possibly adding linguistic annotation.
Corpus-based Where corpora are used to test preformed hypotheses or exemplify existing linguistic theories. Can mean either: (a) Any approach to language that uses corpus data and methods. (b) An approach to linguistics that uses corpus methods but does not subscribe to corpusdriven principles.
Corpus-driven An inductive process where corpora are investigated from the bottom up and patterns found therein are used to explain linguistic regularities and exceptions of the language variety/genre exemplified by those corpora.
Diachronic Diachronic corpora sample (see sampling frame) texts across a span of time or from different periods in time in order to study the changes in the use of language over time. Compare: synchronic.
Encoding The process of representing the structure of a text using markup language and annotation
Frequency list A list of all the items of a given type in a corpus (e.g. all words, all nouns, all four-word sequences) together with a count of how often each occurs
Key word in context (KWIC) A way of displaying a node word or search term in relation to its context within a text; this usually means the node is displayed centrally in a table with co-text displayed in columns to its left and right. Here, ‘keyword’ means ‘search term’ and is distinguished from keyword.
Keyword A word that is more frequent in a text or corpus under study than it is in some (larger) reference corpus. Differences between corpora in how the word being studied occurs will be statistically significant (see, statistical significance) for it to be a keyword.
Lemma A group of words related to the same base word differing only by inflection. For example, walked, walking, and walks are all part of the verb lemma WALK.
Lemmatisation A form of annotation where every token is labelled to indicate its lemma
Lexis The words and other meaningful units (such as idioms) in a language; the lexis or vocabulary of a language in usually viewed as being stored in a kind of mental dictionary, the lexicon.
Markup Codes inserted into a corpus file to indicate features of the original text other than the actual words of the text. In a written text, for example, markup might include paragraph breaks, omitted pictures, and other aspects of layout.
Markup language A system or standard for incorporating markup (and, sometimes, annotation and metadata) into a file of machine-readable text; the standard markup language today is XML.
Metadata The texts that make-up a corpus are the data. Metadata is data about that data - it gives information about things such as the author, publication date, and title for a written text.
Monitor corpus A corpus that grows continually, with new texts being added over time so that the dataset continues to represent the most recent state of the language as well as earlier periods
Node In the study of collocation - and when looking at a key word in context (KWIC) - the node word is the word whose co-occurrence patterns are being studied.
Reference corpus A corpus which, rather than being representative of a particular language variety, attempts to represent the general nature of a language by using a sampling frame emphasing representativeness.
Representativeness A representative corpus is one sampled (see, sample) in such a way that it contains all the types of text, in the correct proportions, that are needed to make the contents of the corpus an accurate reflection of the whole of the language or variety of language that it samples (also see: balance).
Sample A single text, or extract of a text, collected for the purpose of adding it to a corpus. The word sample may also be used in its statistical sense by corpus linguists. In this latter sense, it means groups of cases taken from a population that will, hopefully, represent that population such that findings from the sample can be generalised to the population. In another sense, corpus is a sample of language
Sample corpus A corpus that aims for balance and representativeness within a specified sampling frame
Sampling frame A definition, or set of instructions, for the samples (see: sample) to be included in a corpus. A sampling frame specifies how samples are to be chosen from the population of text, what types of texts are to be chosen, the time they come from and other such features. The number and length of the samples may also be specified.
Significance test A mathematical procedure to determine the statistical significance of a result
Statistical significance A quantitative result is considered statistically significant if there is a low probability (usually lower than 5%) that the figures extracted from the data are simply the result of chance. A variety of statistical procedures can be used to test statistical significance.
Synchronic Relating to the study of language or languages as they exist at a particular moment in time, without reference to how they might change over time (compare: diachronic). A synchronic corpus contains texts drawn from a single period - typically the present or very recent past.
Tagging An informal term for annotation, especially forms of annotation that assign an analysis to every word in a corpus (such as part-of-speech or semantic tagging).
Text As a count noun: a text is any artefact containing language usage - typically a written document or a recorded and/or transcribed spoken text. As a non-count noun: collected discourse, on any scale.
Token Any single, particular instance of an individual word in a text or corpus Compare: lemma, type.
Type (a) A single particular wordform. Any difference of form (e.g. spelling) makes a word a different type. All tokens comprising the same characters are considered to be examples of the same type. (b) Can also be used when discussing text types.
Type-token ratio A measure of vocabulary diversity in a corpus, equal to the total number of types divided by the total number of tokens; the closer the ratio is to 1 (or 100%), the more varied the vocabulary is. This statistic is not comparable between corpora of different sizes.
XML A markup language which is the contemporary standard for use in corpora as well as for a range of data-transmission purposes on the Internet. In XML, tags are indicated by .
Part One: An Introduction to Corpus Linguistics
Introduction to this part's activities Warm up activity Part 1: why use a corpus? Part 2: annotation and mark-up Part 3: types of corpora Part 4: Frequency Data, Concordances and Collocation Part 5: Corpora and Language Teaching Test your Knowledge (Quiz) Why do I need special software? Brown and LOB Downloads Introduction to AntConc AntConc - concordancing AntConc - using advanced search to explore the Brown corpus AntConc - creating and using a wordlist Practical activity - a question Further Reading Discussion question for Part 1
Introduction to this part’s activities In this part, we begin by looking at the background to corpus linguistics – the types of things you can do using a corpus and some of the technical details of how corpora are built. In the ‘how to’ part of this part, we introduce you to the concordance package available free with this course – AntConc, authored by Laurence Anthony of Waseda University. Take notes as you go and use the ‘pop quiz’ to test your comprehension. Undertake the readings for the part and contribute to the discussion.
Warm up activity A quick activity to get started Think of something you would like to find out about language. As you attend the lecture, reflect back on your own interests – what types of corpora might help you and what type of design issues would you have to consider if you were to put together your own corpus to investigate language as you would wish?
Part 1: why use a corpus? The lecturer gives a brief review of why you might want to use a corpus and decisions to make when building a corpus.
Please see: Week 1 Lectures (part 1) Week 1 Slides (Part 1) Week 1 Videos (Part 1)
Part 2: annotation and mark-up The Lecturer gives a brief overview of how corpus texts may be enriched with additional information to ease analysis. Note that this type of additional information may be called ‘mark up’, ‘annotation’, or ‘tagging’. All three terms are near synonyms. Annotation usually refers to linguistic information encoded in a corpus - however, the encoding is achieved using a mark-up language. Similarly, the annotation itself is usually undertaken by putting so called tags - short codes to indicate some linguistics feature - into a text. Hence, while the terms can be separated, they can also be used inter-changeably! One final note - an xml tag finishes with a forward slash rather than a back slash.
Please see: Week 1 Lectures (part 2) Week 1 Slides (Part 2) Week 1 Videos (Part 2)
Part 3: Types of Corpora The Lecturer looks at a range of different types of corpora.
Please see: Week 1 Lectures (part 3) Week 1 Slides (Part 3) Week 1 Videos (Part 3)
\
Part 4: Frequency Data, Concordances and Collocation The Lecturer explores the value of frequency data in corpus linguistics and takes a first look at a key concept in corpus linguistics - collocation. This lecture mentions the idea of normalised frequencies per million. What are these? Imagine you have two corpora, one of two million words and another of three million words. You look in each for the word ‘dalek’ and find 20 examples in the first and 30 examples in the second. That does not mean that the word is more frequent in the second corpus - remember it is bigger. One way of the making this issue apparent, and making the numbers more comparable, is to normalise the frequencies. To normalise per million, you are in essence asking the question ‘if my corpus was only one million words, how many examples would I expect to find?’. Our first corpus is two million words - so to normalise the frequency of ‘dalek’ to one million words, we would divide by two, giving us 20/2=10. The second corpus is three times as large as one million, so to normalise per million we would divide the results from the second corpus by three giving 30/3=10. This shows clearly that we have no reason to claim that the word ‘dalek’ is more frequent in one of the corpora than the other.
Please see: Week 1 Lectures (part 4) Week 1 Slides (Part 4) Week 1 Videos (Part 4)
Part 5: Corpora and Language Teaching The Lecturer takes a brief look at a major application area for corpus linguistics - language teaching. The video concludes by considering some of the limitations of corpus linguistics. After the video, don’t forget to update your journal! Keep a record of what you are learning. You will find it really helps as the course proceeds if you keep clear, structured notes of what you have learnt.
Test your Knowledge (Quiz) What is a corpus? A theory of language A collection of texts stored on a computer An electronic database similar to a dictionary Any large collection of words such as a collection of books, newspapers or magazines
What is the main reason for using corpora? Other methods of language analysis are not reliable Computers can confirm our intuitions about language Computers can help us discover interesting patterns in language which would be difficult to spot otherwise With corpora we can answer all research questions about language What is corpus annotation? Adding an extra layer of information to the text to allow for more sophisticated searches Separating text into sentences Manual coding of text for parts of speech Adding critical comments to a text
What is a specialised corpus? A corpus that is used for historical language investigations A corpus that is composed of a large variety of genres A corpus that is used by language specialists A corpus that focuses on e.g. one type of genre, one period, one place etc
Which of these is NOT a type of corpus? Multilingual corpus Learner corpus Diachronic corpus Observer corpus
What is the BNC? A large general corpus of British English A corpus of different genres of English writing A large spoken corpus of British English A specialised corpus representing the language of newspapers
Which of these statements is NOT true about a monitor corpus? It is frequently updated The Bank of English is an example of a monitor corpus The BNC is an example of a monitor corpus It is used to monitor rapid change in language What is a concordance? Information about word frequencies normalised per million words Listing of examples of a word searched in a corpus with some context on the right and some context on the left An alphabetical list of words that appear in a text A list of words and their frequencies that can be used for identifying important words in a text
What is collocation?
The tendency of speakers to talk over each other The tendency of words to co-occur with one another The tendency of words to appear in unique, different contexts each time The tendency of sentences to create meaning What is a frequency distribution in a corpus? Information about how frequent a word is in a corpus Information about the frequency of use of a term across a number of different texts, corpus sections, speakers etc Information about how frequent a word is per million words Sociolinguistic information about the gender of the speakers that are represented in a corpus
Why do I need special software?
Some of the things you can do with a program like AntConc will be familiar to you from word processing. For example, you can search for a word in a word processor and see the context around each use of that word. So why bother with corpus browsing software? As you will discover, software like AntConc allows you to do so much more than a word processor does. Even for something as simple as searching for a word, it presents the results in a format that is more suitable for those interested in studying language; the standard concordance view of one example per line with left and right context allows you to rapidly browse data looking for patterns of usage. Yet beyond this the software allows you to do a number of things that no word processor does, such as undertaking keyword analyses and looking for collocations. By the time you have finished learning to use AntConc, you will have developed a full appreciation of the need to use such software to study language in use.
Brown and LOB These corpora are sometimes referred to as ‘snapshot’ corpora - their design is such that they try to represent a broad range of genres of published, professionally authored, English. Their goal is to capture the language at one moment in time, hence the term ‘snapshot’. Of course, as with any snapshot there are things you see and things you do not see. So, in this case, we are looking at professionally authored written English - not speech and not writing of a more informal variety. We are also only looking at certain genres. As with any snapshot, it was taken at a certain point of time in a certain place - Brown is America in the early 1960s, LOB is the UK in the early 1960s. Such corpora are often used to compare and contrast varieties of a language - in this case two varieties of English. They can also be looked at on their own to explore either variety of English in its own right. The Brown corpus is so named because it was developed at Brown University in the US. LOB is an acronym, standing for Lancaster-Oslo-Bergen, the three Universities that collaborated to build that corpus. Back to the snapshot metaphor! The two corpora can be compared because they are composed in the same way - the subject is the same, if you like. They look at broadly the same genres. Those genres are represented by similarly sized and numbers of chunks of data. Also, of course, the data was gathered in roughly the same time period. The genres covered in the two corpora are outlined below. Note the letter code for each genre - that is important, as it shows you which genre is associated with which file in the corpus. Following the letter code is a description of the type of data in the category, followed by two numbers in parentheses - the first is the number of chunks of data in that category in Brown, the second is the number of chunks of data in that category in LOB. There are five hundred chunks of data in each corpus. Each chunk is approximately 2,000 words in size, giving a rough overall corpus size of 1,000,000 words each.
A Press: reportage (44, 44) B Press: editorial (27, 27) C Press: reviews (17, 17) D Religion (17, 17) E Skills, trades and hobbies (36, 38) F Popular lore (48, 44) G Belles lettres, biography, essays (75, 77) H Miscellaneous (documents, reports, etc.) (30, 30) J Learned and scientific writings (80, 80) K General fiction (29, 29) L Mystery and detective fiction (24, 24) M Science fiction (6, 6) N Adventure and western fiction (29, 29) P Romance and love story (29, 29) R Humour (9, 9)
Downloads
(The instructor will provide students with the different software packages and corpora) Instructions on how to download AntConc and the Brown and LOB corpora for analysis How to download AntConc The latest versions (3.4.3w, 3.4.3m, 3.4.3u) of AntConc are available for download from Laurence Anthony’s website at: http://www.antlab.sci.waseda.ac.jp/software.html Choose the version you want to run (i.e. for Windows, Mac or Linux) and click the link for version 3.4.3 If you are using a Windows computer, you will download a single executable (.exe) file. Put this on your desktop or in some other area that is easy for you to access. Double click to start. If you are using a Linux computer, you will download a tar.gz folder that you need to decompress first. Inside the folder, you will find the AntConc executable file, an icon, and a simple setup guide. Set the permissions of the executable file and double click to start. If you are using a Macintosh computer, you will download a zip file that you need to unzip first. Put the unzipped AntConc application on your desktop or in some other area which is easy for you to access. Double click to start. (At this point, you may get one or two security warnings. AntConc is completely virus free, so you can ignore these warnings or, if necessary, disable them via the System Preferences.)
How to download Brown and LOB corpora Important Note
The Brown and Lob corpora are only made available to learners of the FutureLearn Corpus Linguistics: Method, Analysis and Interpretation FutureLearn course. They should not be re-distributed or re-published. The LOB corpus is made available to you by ICAME. Click this link to download a zip file containing the two corpora. To use the corpora, first, unzip the file (see below), and then drag the two folders inside (“brown_corpus_untagged” and “lob_corpus_untagged”) to a convenient place on your computer. We suggest you place them in a new folder called “corpora”. You can then delete the original zip file if you want. If you are using a Windows computer, you can unzip the file by right clicking on the file name and selecting “Extract All”. The unzipped file will open in a new window where you can see the two corpora. If you are using a Macintosh computer, you can unzip the file by simply double-clicking on it. You can then open the unzipped file and see the two corpora inside. If you are using a Linux computer, unzip the file using your preferred zip program. On most systems you can simply double click the file and then move the two corpora inside to a convenient place. If you are experiencing problems downloading or have other technical issues please post a question on this page. If anyone has resolved issues, please feel free to post your solutions.
Introduction to AntConc Part one of an introduction to the AntConc program. In this video Laurence Anthony tells you how to download and install AntConc, how to load corpus files into the program and introduces some of the first steps you can take in analysing corpus data. This includes showing you how to build a wordlist from a corpus. As part of this, you will hear the terms type and token. A token is any given word in the corpus. A type is the number of unique word forms present in a corpus. Imagine your corpus is the sentence “I came, I saw, I concordanced”. This sentence contains six separate words - hence there are six tokens in the sentence. However, there are only four unique words in the corpus - the token ‘I’ repeats 3 times. So the types in the corpus are ‘I’, ‘came’, ‘saw’ and ‘concordanced’. Thus the sentence has six tokens and four types. Note that we can, of course, quibble about the definition of a word! Consider the word ‘gonna’ - some may argue this is two words, others one.
Please see: AntConc Videos (1) AntConc Transcript (1)
AntConc - concordancing
Laurence Anthony looks at some of the basic features of the AntConc concordance tool. Topics covered include how to load a corpus, how to search for words in a corpus, how to order the results of a search and how to search for parts of words.
Please see: AntConc Videos (2) AntConc Transcript (2)
AntConc - using advanced search to explore the Brown corpus Laurence Anthony looks at some of the advanced features of the AntConc program. Laurence works with a subset of the Brown corpus, demonstrating the functions of the concordance window in AntConc, including the use of the advanced search box.
Please see: AntConc Videos (3) AntConc Transcript (3)
AntConc - creating and using a wordlist Laurence Anthony shows you how to build a frequency wordlist from a corpus. In addition, he covers some related issues such as sorting the list and searching it. Download the lemma list (Ask the instructor)
Please see: AntConc Videos (4) AntConc Transcript (4)
Practical activity - a question Take the LOB corpus and build a word list. Look at the top thirty words. How would you characterise these words? Do the same with the Brown corpus. Is it similar? Are there any differences between LOB and Brown? Feel free to concordance the words to inform your analysis. If you have the time, do the same with the subsections of LOB and Brown. Might wordlists help to determine genre?
Further Reading: Our readings this week come to us courtesy of Edinburgh University Press and Routledge
Our first reading is taken from: (Week 1 PDF 1) McEnery, T. and Wilson, A. (2001) Corpus Linguistics, Edinburgh University Press, Edinburgh. It is chapter one of this book. It will help you broaden your understanding of the background to corpus linguistics and will place in historical context the move away from, and return to, corpus data in linguistics. The second reading is chapter one of: (Week 1 PDF 2) Garside, R., Leech, G. and McEnery, T. (1997) Corpus Annotation, Longman, Harlow This book will be of great assistance to you throughout this course. Each time you hear or see a type of annotation discussed, you should be able to use this book as a useful reference guide to find out what that type of annotation is and how it is undertaken. While published in 1997, this book is still a good reference guide. For this week, read chapter 1 of the book - Leech’s outline of the principles of corpus annotation are as relevant today as they were when they were written.
Discussion question for Part 1 When you have completed the lecture and the associated readings, consider and discuss the following statement: “Noam Chomsky is one of the most influential figures in corpus linguists. His ideas have shaped corpus linguistics while also, paradoxically, seeking to deny its value”. Given what you have read, discuss what is a somewhat deliberately provocative statement!
Reflect back on the warm up activity and your readings this week. Think about what you would like to use corpora for and consider the types of corpora you would need to use. Discuss the design aspects of your proposed work. For example, what type of corpus would you have to use? How large do you think it would have to be? Would annotation help you and if so what sort? Discuss these and any other questions related to your proposed use of corpus data.