Corpus Queries
Development of an effective grammatical query methodology in the context of a parsed corpus
Ref: R 000 22 2598
Institution: University College London
Department: Department of English (Survey of English Usage)
Investigator: Sean Wallis
Period: 1 March 1998 to 31 January 1999 (leave of absence
in November 1998)
Original aims and objectives
In recent years, corpus linguistics has developed dramatically, due to increased computing power and improvements in annotation software. This has precipitated a growth in the scale and complexity of corpora, including the new grammatically annotated ICE-GB corpus. Text corpora have been used both to improve software tools, such as grammatical parsers, and to improve our understanding of language.
The research is to develop a linguistically plausible and transparent method of forming queries for grammatical corpora.
The proposal is to use fragments of grammatical trees as the main representation for queries. These "fuzzy tree fragments" appeal because of the obvious parallel with familiar grammatical structure. The difference is that a query must capture both what is known and what is unknown: some components and relations may be ommitted or "fuzzy". Developing this notion of "fuzziness" is a major part of the research.
Complex queries may then be constructed by combining these tree fragments with sociolinguistic variables using a logical language.
This project will run concurrently with the first release of the ICE-GB corpus, and an early prototype of the system will be provided at this point. Feedback from end users will be used to aid further development.
Comment
Although this project was very modest in duration and scope, the results proved to be extremely important and influential. The Corpus Query project permitted the development of Fuzzy Tree Fragments and ICECUP 3.0. The software was indeed published alongside ICE-GB Release 1 in 1998, and has continued to improve ever since.
See also
Research Results
Fuzzy Tree Fragments
ICECUP software
This page last modified 14 May, 2020 by Survey Web Administrator.