Book cover

Statistics in Corpus Linguistics Research

A New Approach

ISBN 9781138589384

This webpage aggregates resources to be read and used alongside this book.

Each chapter has its own section, which you can reveal with a touch or click. Tap or click on either the chapter title or label to show or hide the resources for that chapter:

Further readingSuggested further reading.
Data and analysisWorked examples of calculations especially for this book.
CalculatorGeneral purpose calculator tools.

Citation

Wallis, Sean (2021). Statistics in Corpus Linguistics Research – A New Approach, New York, London: Routledge. » Publisher's website

Table of contents

Statistics in Corpus Linguistics Research – A New Approach

Contents

Preface

Part 1. Motivations

1. What might corpora tell us about language?
Further reading
  • Software: ICECUP, see also Fuzzy Tree Fragments
  • Corpora: ICE-GB and DCPSE
  • Aarts, Bas, Jo Close and Sean Wallis 2013. Choices over time: methodological issues in current change. Chapter 2 in Bas Aarts, Jo Close, Geoffrey Leech and Sean Wallis 2013 (eds.) The Verb Phrase in English. Cambridge University Press. » ePublished
  • Wallis, Sean. 2019. Investigating the additive probability of repeated language production decisions. International Journal of Corpus Linguistics 24(4), 490-521. » corp.ling.stats » ePublished » Data and spreadsheets

Part 2. Designing effective experiments with corpora

2. The idea of corpus experiments
Data and analysis
3. That vexed problem of choice
Further reading
  • Aarts, Bas, Jo Close and Sean Wallis (2013). Choices over time: methodological issues in current change. Chapter 2 in Bas Aarts, Jo Close, Geoffrey Leech and Sean Wallis (eds.) The Verb Phrase in English. Cambridge University Press. » ePublished
  • Bowie, Jill, Sean Wallis and Bas Aarts (2013). The perfect in spoken British English. Chapter 13 in Aarts, Close, Leech and Wallis 2013. » ePublished
  • Bowie, Jill, Sean Wallis and Bas Aarts (2014). Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Juana I. Marín-Arrese, Marta Carretero, Jorge Arús Hita and Johan van der Auwera (eds.) English Modality, Berlin: De Gruyter, 57-94.
  • Wallis, Sean (2020). Grammar and Corpus Methodology. Chapter 4 in Bas Aarts, Gergana Popova and Jill Bowie (eds.) The Oxford Handbook of English Grammar. Oxford University Press.
4. Choice versus meaning
Data and analysis
5. Balanced samples and imagined populations
Further reading
  • Bowie, Jill, Sean Wallis and Bas Aarts (2014). Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Juana I. Marín-Arrese, Marta Carretero, Jorge Arús Hita and Johan van der Auwera (eds.) English Modality, Berlin: De Gruyter, 57-94.

Part 3. Confidence intervals and significance tests

6. Introducing inferential statistics
Calculator
Further reading
  • Stahl, S. (2006). The Evolution of the Normal Distribution. Mathematics Magazine, 79(2), 96-113. » ePublished
7. Plotting with confidence
Data and analysis
  • Magnus Levin's BE thinking data (Excel spreadsheet)
    • Plotting Wilson intervals, Wilson c.c.
    • Plotting Newcombe-Wilson intervals, percentage difference and floating bar chart
    • Same sample z test — see Chapter 9
8. From intervals to tests
Calculator
Data and analysis
Further reading
  • Brown, Lawrence, Tony Cai and Anirban DasGupta (2001). Interval estimation for a binomial proportion. Statistical Science 16, 101-133.
  • Newcombe, Robert (1998a). Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine 17, 857-872.
  • Newcombe, Robert (1998b). Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine 17, 873-890.
  • Wallis, Sean (2013). Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20(3), 178-208. » corp.ling.stats » ePublished
  • 2 x 2, 2 x 1 χ² tests (Excel spreadsheet)
    • 2 x 1 goodness of fit χ², Yates's χ², Wilson, Wilson c.c.
    • 2 x 2 homogeneity (independence) tests including χ², Yates's χ², Newcombe-Wilson, Newcombe-Wilson c.c.
  • Wilson finite population calculations (Excel spreadsheet)
    • Single proportion example
    • Newcombe-Wilson test adjustment following resampling — see Chapter 16
9. Comparing frequencies within the same distribution
Calculator
Data and analysis
  • Magnus Levin's BE thinking data (Excel spreadsheet)
    • Plotting Wilson intervals, Wilson c.c.
    • Plotting Newcombe-Wilson intervals, percentage difference and floating bar chart — see Chapter 7
    • Same sample z test
10. Reciprocating the Wilson interval
Data and analysis
11. Competition between choices over time
Further reading
  • Bowie, Jill and Sean Wallis (2016). The to-infinitival perfect: a study of decline. In Valentin Werner, Elena Seoane and Cristina Suárez-Gómez (eds.) Re-assessing the Present Perfect, Topics in English Linguistics (TiEL) 91. Berlin: De Gruyter, 43-94.
  • Neumerzhitckii, Evgenii (2018). Three-body simulator (website).
  • Wallis, Sean (2020). Boundaries in nature, corp.ling.stats, London: Survey of English Usage, UCL.
12. The replication crisis and the New Statistics
Further reading
  • Gelman, Andrew (2016). What has happened down here is the winds have changed. Statistical Modelling, Causal Inference and Social Science. » blog post
  • Gelman, Andrew and Eric Loken (2013). The garden of forking paths. Columbia University. » ePublished
  • Leech, Geoffrey (2011). The modals ARE declining: reply to Neil Millar’s ‘Modal verbs in TIME: frequency changes 1923–2006’. International Journal of Corpus Linguistics, 16(4). 547-564.
  • Millar, Neil (2009). Modal verbs in TIME: frequency changes 1923–2006. International Journal of Corpus Linguistics, 14(2), 191-220.
13. Choosing the right test
Further reading
  • Gries, Stefan Th. (2015). The most underused statistical method in corpus linguistics: Multi-level (and mixed-effects) models. Corpora, 10(1), 95-125.
  • Oakes, Michael (1998). Statistics for Corpus Linguistics. Edinburgh: EUP.
  • Ruxton, Graeme (2006). The unequal variance t-test is an underused alternative to Student’s t-test and the Mann-Whitney U test. Behavioral Ecology, 17, 688–690
  • Sheskin, David (2011). Handbook of Parametric and Nonparametric Statistical Procedures (5th ed.). Boca Raton, Fl: CRC Press.

Part 4. From effect sizes to meta-tests

14. The size of an effect
Further reading
  • Wallis, Sean (2012). Goodness of fit measures for discrete categorical data, corp.ling.stats, London: Survey of English Usage, UCL.
  • Wallis, Sean (2012). Measures of association for contingency tables, corp.ling.stats, London: Survey of English Usage, UCL.
  • Wallis, Sean (2019). Confidence intervals on pairwise φ statistics, corp.ling.stats, London: Survey of English Usage, UCL.
15. Meta-tests for comparing tables of results
Calculator
  • separability tests (Excel spreadsheet)
    • homogeneity (gradient, point, multi-point)
      2 x 2 χ², Yates's χ², Newcombe-Wilson, Newcombe-Wilson c.c. and φ comparisons
      r x 2 χ², Yates's χ²
      r x c χ²
    • goodness of fit
      2 x 1 χ², Wilson, Wilson c.c.
      r x 1 χ²
    • homogeneity superset/subset (gradient, point, multi-point)
      2 x 2 χ², Yates's χ², Wilson, Wilson c.c.
    • goodness of fit superset/subset
      2 x 1 Wilson, Wilson c.c.

Part 5. Statistical solutions for corpus samples

16. Coping with imperfect data
Calculator
Further reading
  • Wallis, Sean and Seth Mehl (forthcoming). Comparing baselines for corpus analysis: Research into the get-passive in speech and writing. In: Ole Schützler and Julia Schlüter (eds.) Data and Methods in Corpus Linguistics: Comparative Approaches. Cambridge University Press.
17. Adjusting intervals for random-text samples
Data and analysis

Part 6. Concluding remarks

18. Plotting the Wilson distribution
Calculator
  • Wilson distribution (Excel spreadsheet)
    • Wilson, Wilson c.c., Logit Wilson
    • Clopper-Pearson (up to n = 10)
19. In conclusion

Appendices

Appendix A. The interval equality principle
Appendix B. Pseudo-code for computational procedures

Glossary

References

Blog

corp.ling.stats

Publisher’s website

Please send all comments and questions to s.wallis@ucl.ac.uk

This page last modified 20 January, 2021 by Survey Web Administrator.