Statistics in Corpus Linguistics Research
A New Approach
ISBN 9781138589384
This webpage aggregates resources to be read and used alongside this book.
Each chapter has its own section, which you can reveal with a touch or click. Tap or click on either the chapter title or label to show or hide the resources for that chapter:
Further reading | Suggested further reading. | |
Data and analysis | Worked examples of calculations especially for this book. | |
Calculator | General purpose calculator tools. |
Citation
Wallis, Sean (2021). Statistics in Corpus Linguistics Research – A New Approach, New York, London: Routledge. » Publisher's website
Table of contents
Statistics in Corpus Linguistics Research – A New Approach
Contents
Preface
Part 1. Motivations
1. What might corpora tell us about language?
Further reading
- Software: ICECUP, see also Fuzzy Tree Fragments
- Corpora: ICE-GB and DCPSE
- Aarts, Bas, Jo Close and Sean Wallis 2013. Choices over time: methodological issues in current change. Chapter 2 in Bas Aarts, Jo Close, Geoffrey Leech and Sean Wallis 2013 (eds.) The Verb Phrase in English. Cambridge University Press. » ePublished
- Wallis, Sean. 2019. Investigating the additive probability of repeated language production decisions. International Journal of Corpus Linguistics 24(4), 490-521. » corp.ling.stats » ePublished » Data and spreadsheets
Part 2. Designing effective experiments with corpora
2. The idea of corpus experiments
Data and analysis
- Simple analysis of who vs. whom data (Excel spreadsheet)
3. That vexed problem of choice
Further reading
- Aarts, Bas, Jo Close and Sean Wallis (2013). Choices over time: methodological issues in current change. Chapter 2 in Bas Aarts, Jo Close, Geoffrey Leech and Sean Wallis (eds.) The Verb Phrase in English. Cambridge University Press. » ePublished
- Bowie, Jill, Sean Wallis and Bas Aarts (2013). The perfect in spoken British English. Chapter 13 in Aarts, Close, Leech and Wallis 2013. » ePublished
- Bowie, Jill, Sean Wallis and Bas Aarts (2014). Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Juana I. Marín-Arrese, Marta Carretero, Jorge Arús Hita and Johan van der Auwera (eds.) English Modality, Berlin: De Gruyter, 57-94.
- Wallis, Sean (2020). Grammar and Corpus Methodology. Chapter 4 in Bas Aarts, Gergana Popova and Jill Bowie (eds.) The Oxford Handbook of English Grammar. Oxford University Press.
4. Choice versus meaning
Data and analysis
- Semasiological and onomasiological analysis of very data (Excel spreadsheet)
5. Balanced samples and imagined populations
Further reading
- Bowie, Jill, Sean Wallis and Bas Aarts (2014). Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Juana I. Marín-Arrese, Marta Carretero, Jorge Arús Hita and Johan van der Auwera (eds.) English Modality, Berlin: De Gruyter, 57-94.
Part 3. Confidence intervals and significance tests
6. Introducing inferential statistics
Calculator
Further reading
- Stahl, S. (2006). The Evolution of the Normal Distribution. Mathematics Magazine, 79(2), 96-113. » ePublished
- Binomial demonstrator (Excel spreadsheet)
7. Plotting with confidence
Data and analysis
- Magnus Levin's BE thinking data (Excel spreadsheet)
- Plotting Wilson intervals, Wilson c.c.
- Plotting Newcombe-Wilson intervals, percentage difference and floating bar chart
- Same sample z test — see Chapter 9
8. From intervals to tests
Calculator
Data and analysis
Further reading
- Brown, Lawrence, Tony Cai and Anirban DasGupta (2001). Interval estimation for a binomial proportion. Statistical Science 16, 101-133.
- Newcombe, Robert (1998a). Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine 17, 857-872.
- Newcombe, Robert (1998b). Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine 17, 873-890.
- Wallis, Sean (2013). Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20(3), 178-208. » corp.ling.stats » ePublished
- Mood x transitivity analysis (Excel spreadsheet)
- 2 x 2, 2 x 1 χ²
tests (Excel spreadsheet)
- 2 x 1 goodness of fit χ², Yates's χ², Wilson, Wilson c.c.
- 2 x 2 homogeneity (independence) tests including χ², Yates's χ², Newcombe-Wilson, Newcombe-Wilson c.c.
- Wilson
finite population calculations (Excel spreadsheet)
- Single proportion example
- Newcombe-Wilson test adjustment following resampling — see Chapter 16
9. Comparing frequencies within the same distribution
Calculator
Data and analysis
- Magnus Levin's BE thinking data (Excel spreadsheet)
- Plotting Wilson intervals, Wilson c.c.
- Plotting Newcombe-Wilson intervals, percentage difference and floating bar chart — see Chapter 7
- Same sample z test
- Single-sample z test (Excel spreadsheet)
10. Reciprocating the Wilson interval
Data and analysis
- Analysis of sentence length data (Excel spreadsheet)
11. Competition between choices over time
Further reading
- Bowie, Jill and Sean Wallis (2016). The to-infinitival perfect: a study of decline. In Valentin Werner, Elena Seoane and Cristina Suárez-Gómez (eds.) Re-assessing the Present Perfect, Topics in English Linguistics (TiEL) 91. Berlin: De Gruyter, 43-94.
- Neumerzhitckii, Evgenii (2018). Three-body simulator (website).
- Wallis, Sean (2020). Boundaries in nature, corp.ling.stats, London: Survey of English Usage, UCL.
12. The replication crisis and the New Statistics
Further reading
- Gelman, Andrew (2016). What has happened down here is the winds have changed. Statistical Modelling, Causal Inference and Social Science. » blog post
- Gelman, Andrew and Eric Loken (2013). The garden of forking paths. Columbia University. » ePublished
- Leech, Geoffrey (2011). The modals ARE declining: reply to Neil Millar’s ‘Modal verbs in TIME: frequency changes 1923–2006’. International Journal of Corpus Linguistics, 16(4). 547-564.
- Millar, Neil (2009). Modal verbs in TIME: frequency changes 1923–2006. International Journal of Corpus Linguistics, 14(2), 191-220.
13. Choosing the right test
Further reading
- Gries, Stefan Th. (2015). The most underused statistical method in corpus linguistics: Multi-level (and mixed-effects) models. Corpora, 10(1), 95-125.
- Oakes, Michael (1998). Statistics for Corpus Linguistics. Edinburgh: EUP.
- Ruxton, Graeme (2006). The unequal variance t-test is an underused alternative to Student’s t-test and the Mann-Whitney U test. Behavioral Ecology, 17, 688–690
- Sheskin, David (2011). Handbook of Parametric and Nonparametric Statistical Procedures (5th ed.). Boca Raton, Fl: CRC Press.
Part 4. From effect sizes to meta-tests
14. The size of an effect
Further reading
- Wallis, Sean (2012). Goodness of fit measures for discrete categorical data, corp.ling.stats, London: Survey of English Usage, UCL.
- Wallis, Sean (2012). Measures of association for contingency tables, corp.ling.stats, London: Survey of English Usage, UCL.
- Wallis, Sean (2019). Confidence intervals on pairwise φ statistics, corp.ling.stats, London: Survey of English Usage, UCL.
15. Meta-tests for comparing tables of results
Calculator
- separability
tests (Excel spreadsheet)
- homogeneity (gradient, point, multi-point)
2 x 2 χ², Yates's χ², Newcombe-Wilson, Newcombe-Wilson c.c. and φ comparisons
r x 2 χ², Yates's χ²
r x c χ² - goodness of fit
2 x 1 χ², Wilson, Wilson c.c.
r x 1 χ² - homogeneity superset/subset (gradient, point, multi-point)
2 x 2 χ², Yates's χ², Wilson, Wilson c.c. - goodness of fit superset/subset
2 x 1 Wilson, Wilson c.c.
- homogeneity (gradient, point, multi-point)
Part 5. Statistical solutions for corpus samples
16. Coping with imperfect data
Calculator
Further reading
- Wallis, Sean and Seth Mehl (forthcoming). Comparing baselines for corpus analysis: Research into the get-passive in speech and writing. In: Ole Schützler and Julia Schlüter (eds.) Data and Methods in Corpus Linguistics: Comparative Approaches. Cambridge University Press.
- Wilson
finite population calculations (Excel spreadsheet)
- Single proportion example — see Chapter 8
- Newcombe-Wilson test adjustment following resampling
17. Adjusting intervals for random-text samples
Data and analysis
- Large sample analysis of p(inter) and p(CL | word) data (Excel spreadsheet)
- Analysis of p(inter) data for each ICE-GB text category (Excel spreadsheet)
- Partitoned and pooled analysis of shall / will data in DCPSE (Excel spreadsheet)
Part 6. Concluding remarks
18. Plotting the Wilson distribution
Calculator
- Wilson distribution (Excel
spreadsheet)
- Wilson, Wilson c.c., Logit Wilson
- Clopper-Pearson (up to n = 10)
19. In conclusion
Appendices
Appendix A. The interval equality principle
Appendix B. Pseudo-code for computational procedures
Glossary
References
Blog
corp.ling.stats
Publisher’s website
- Statistics in Corpus Linguistics Research (Routledge)
Please send all comments and questions to s.wallis@ucl.ac.uk
This page last modified 20 January, 2021 by Survey Web Administrator.