Tuesday, 27 December 2011

Corpus Linguistics Theory

Corpus linguistics has been various defined. According to McEnery, Xiao and Tono (2005: 7-8) “Corpus linguistics is a whole system of methods and principles of how to apply corpora in language studies and teaching/learning, it certainly has a theoretical status. Yet theoretical status is not theory in itself”.  Therefore, they claimed that corpus linguistics is a methodology. Corpus linguistics is also defined as a methodology in McEnery, and Wilson (1996) and Meyer (2002), and as “an approach or a methodology for studying language use” in Bowker & Pearson (2002: 9). Teubert (2005: 4) asserted that “Corpus linguistics is not in itself a method: many different methods are used in processing and analysing corpus data. It is rather an insistence on working only with real language data taken from the discourse in a principled way and compiled into a corpus”.
      Biber, Conrad and Reppen (1998) said that studying language can be divided into two main areas: studies of structure and studies of language use. From language use perspective, researchers can investigate how speakers and writers exploit the resources of their language rather than looking at what is theoretically possible in a language. In other words, researchers study the actual language used in naturally occurring texts. The goal of corpus-based investigations is not simply to report quantitative findings, but explore the importance of these findings for learning about the patterns of language use. That is, it is essential to include both qualitative, functional interpretations of quantitative patterns.
      Thus, Sinclair (2005:10) offered a reasonable and short definition of a corpus to conclude that “A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research”.
There are three different types of corpus linguistics theoretical framework in which makes a distinct methodology towards Corpus Linguistics, determining the status researchers place on both annotation and corpus data (Walls, 2007).                    
   
                                            1. Top-down corpus annotation (Walls, 2007).              2. Bottom-up corpus annotation (Walls, 2007).

 Top-down refers knowledge is in the scheme which simply applies overall  theoretical principles in the scheme to specific examples in the text whereas  Bottom-up refers knowledge is in the text that  select their facts to fit the theory are ignoring linguistic evidence and they work from their data to upwards. The theory shows that a 'true' scientific approach that express authentic linguistic performance. The following cyclic corpus annotation shows knowledge is in the text and in the scheme.
                                          
       New observations generalize hypotheses or focus theory and the theory is needed to interpret and classify observations both are admitted by cyclic point of viewers. Moreover, the corpus may include two cycles in order to achieve a balanced text selection. This is because one for identifying the text to annotate and another for refining the scheme and knowledge base.
According Wallis and Nelson (2001), there are three stages of Corpus Linguistics which constructing “The 3A model” perspective: Annotation, Abstraction and Analysis. Annotation consists of the application of a scheme to texts which includes different variables such as structural markup, part of speech-tagging, parsing, and other representations. On contrary, Abstraction helps to choose the process selecting a research topic or giving definition of operational terms in the scheme to terms in a theoretically motivated model or collected data. Moreover, Analysis includes proving, manipulating, amending, interpreting and generalizing from the statistical data. It might include statistical evaluations, optimization of rule-bases or knowledge discovery methods. Every processes is cyclic and each higher level depends on the lower ones (Walls, 2007). Therefore, abstraction is based on annotation.







Figure 5. The 3A model of Corpus Linguistics. (Walls, 2007)


References:

Biber, D., Conrad, S. & Reppen, R.  (1998) Corpus linguistics: investigating language      structure and use.         Cambridge: Cambridge University Press

Bowker, Lynne and Jennifer Pearson. 2002. Working with specialized language: A practical guide to using corpora.  London: Routledge.

McEnery, Tony and Andrew Wilson. 1996. Corpus linguistics. Edinburgh: Edinburgh
               University Press.

McEnery, Tony, Richard Z. Xiao and Yukio Tono. 2005. Corpus-based language
               studies: An advanced resource book . London: Routledge.

Meyer, Charles F. 2002. English corpus linguistics: An introduction. Cambridge: Cambridge University Press.

Sinclair, 2005 J. Sinclair, Corpus and text – Basic principles. In: M. Wynne, Editor,     Developing linguistic corpora: A guide to good practice, Oxbow Books, Oxford (2005), pp. 1–16. Retrieved Sep. 26, 2011. http://ahds.ac.uk/linguistic-corpora/.

Teubert, Wolfgang. 2005. My version of corpus linguistics. International Journal of Corpus Linguistics 10(1): 1–13.

Wallis, S. and Nelson G. 'Knowledge discovery in grammatically analysed corpora'. Data Mining and Knowledge Discovery, 5: 307-340. 2001

Wallis, S. 'Annotation, Retrieval and Experimentation', in Meurman-Solin, A. & Nurmi, A.A. (ed.) Annotating Variation and Change. Helsinki: Varieng, University of Helsinki. 2007. e-Published. Retrieved from: http://www.helsinki.fi/varieng/journal/volumes/01/wallis on 16th Oct, 2011.

No comments:

Post a Comment