Survey of English Usage

Statistics in Corpus Linguistics Research

A New Approach

ISBN 9781138589384

This webpage aggregates resources to be read and used alongside this book.

Each chapter has its own section, which you can reveal with a touch or click. Tap or click on either the chapter title or label to show or hide the resources for that chapter:

Further reading		Suggested further reading.

Data and analysis		Worked examples of calculations especially for this book.

Calculator		General purpose calculator tools.

Citation

Wallis, Sean (2021). Statistics in Corpus Linguistics Research – A New Approach, New York, London: Routledge. » Publisher's website

Statistics in Corpus Linguistics Research – A New Approach

Preface

Part 1. Motivations

1. What might corpora tell us about language?

Further reading

Software: ICECUP, see also Fuzzy Tree Fragments
Corpora: ICE-GB and DCPSE
Aarts, Bas, Jo Close and Sean Wallis 2013. Choices over time: methodological issues in current change. Chapter 2 in Bas Aarts, Jo Close, Geoffrey Leech and Sean Wallis 2013 (eds.) The Verb Phrase in English. Cambridge University Press. » ePublished
Wallis, Sean. 2019. Investigating the additive probability of repeated language production decisions. International Journal of Corpus Linguistics 24(4), 490-521. » corp.ling.stats » ePublished » Data and spreadsheets

Part 2. Designing effective experiments with corpora

2. The idea of corpus experiments

Data and analysis

Simple analysis of who vs. whom data (Excel spreadsheet)

3. That vexed problem of choice

Further reading

Aarts, Bas, Jo Close and Sean Wallis (2013). Choices over time: methodological issues in current change. Chapter 2 in Bas Aarts, Jo Close, Geoffrey Leech and Sean Wallis (eds.) The Verb Phrase in English. Cambridge University Press. » ePublished
Bowie, Jill, Sean Wallis and Bas Aarts (2013). The perfect in spoken British English. Chapter 13 in Aarts, Close, Leech and Wallis 2013. » ePublished
Bowie, Jill, Sean Wallis and Bas Aarts (2014). Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Juana I. Marín-Arrese, Marta Carretero, Jorge Arús Hita and Johan van der Auwera (eds.) English Modality, Berlin: De Gruyter, 57-94.
Wallis, Sean (2020). Grammar and Corpus Methodology. Chapter 4 in Bas Aarts, Gergana Popova and Jill Bowie (eds.) The Oxford Handbook of English Grammar. Oxford University Press.

4. Choice versus meaning

Data and analysis

Semasiological and onomasiological analysis of very data (Excel spreadsheet)

5. Balanced samples and imagined populations

Further reading

Bowie, Jill, Sean Wallis and Bas Aarts (2014). Contemporary change in modal usage in spoken British English: mapping the impact of “genre”. In Juana I. Marín-Arrese, Marta Carretero, Jorge Arús Hita and Johan van der Auwera (eds.) English Modality, Berlin: De Gruyter, 57-94.

Part 3. Confidence intervals and significance tests

6. Introducing inferential statistics

Calculator

Further reading

Stahl, S. (2006). The Evolution of the Normal Distribution. Mathematics Magazine, 79(2), 96-113. » ePublished

Binomial demonstrator (Excel spreadsheet)

7. Plotting with confidence

Data and analysis

Magnus Levin's BE thinking data (Excel spreadsheet)
- Plotting Wilson intervals, Wilson c.c.
- Plotting Newcombe-Wilson intervals, percentage difference and floating bar chart
- Same sample z test — see Chapter 9

8. From intervals to tests

Calculator

Data and analysis

Further reading

Brown, Lawrence, Tony Cai and Anirban DasGupta (2001). Interval estimation for a binomial proportion. Statistical Science 16, 101-133.
Newcombe, Robert (1998a). Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine 17, 857-872.
Newcombe, Robert (1998b). Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine 17, 873-890.
Wallis, Sean (2013). Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods. Journal of Quantitative Linguistics 20(3), 178-208. » corp.ling.stats » ePublished

Mood x transitivity analysis (Excel spreadsheet)

2 x 2, 2 x 1 χ² tests (Excel spreadsheet)
- 2 x 1 goodness of fit χ², Yates's χ², Wilson, Wilson c.c.
- 2 x 2 homogeneity (independence) tests including χ², Yates's χ², Newcombe-Wilson, Newcombe-Wilson c.c.
Wilson finite population calculations (Excel spreadsheet)
- Single proportion example
- Newcombe-Wilson test adjustment following resampling — see Chapter 16

9. Comparing frequencies within the same distribution

Calculator

Data and analysis

Magnus Levin's BE thinking data (Excel spreadsheet)
- Plotting Wilson intervals, Wilson c.c.
- Plotting Newcombe-Wilson intervals, percentage difference and floating bar chart — see Chapter 7
- Same sample z test

Single-sample z test (Excel spreadsheet)

10. Reciprocating the Wilson interval

Data and analysis

Analysis of sentence length data (Excel spreadsheet)

11. Competition between choices over time

Further reading

Bowie, Jill and Sean Wallis (2016). The to-infinitival perfect: a study of decline. In Valentin Werner, Elena Seoane and Cristina Suárez-Gómez (eds.) Re-assessing the Present Perfect, Topics in English Linguistics (TiEL) 91. Berlin: De Gruyter, 43-94.
Neumerzhitckii, Evgenii (2018). Three-body simulator (website).
Wallis, Sean (2020). Boundaries in nature, corp.ling.stats, London: Survey of English Usage, UCL.

12. The replication crisis and the New Statistics

Further reading

Gelman, Andrew (2016). What has happened down here is the winds have changed. Statistical Modelling, Causal Inference and Social Science. » blog post
Gelman, Andrew and Eric Loken (2013). The garden of forking paths. Columbia University. » ePublished
Leech, Geoffrey (2011). The modals ARE declining: reply to Neil Millar’s ‘Modal verbs in TIME: frequency changes 1923–2006’. International Journal of Corpus Linguistics, 16(4). 547-564.
Millar, Neil (2009). Modal verbs in TIME: frequency changes 1923–2006. International Journal of Corpus Linguistics, 14(2), 191-220.

13. Choosing the right test

Further reading

Gries, Stefan Th. (2015). The most underused statistical method in corpus linguistics: Multi-level (and mixed-effects) models. Corpora, 10(1), 95-125.
Oakes, Michael (1998). Statistics for Corpus Linguistics. Edinburgh: EUP.
Ruxton, Graeme (2006). The unequal variance t-test is an underused alternative to Student’s t-test and the Mann-Whitney U test. Behavioral Ecology, 17, 688–690
Sheskin, David (2011). Handbook of Parametric and Nonparametric Statistical Procedures (5th ed.). Boca Raton, Fl: CRC Press.

Part 4. From effect sizes to meta-tests

14. The size of an effect

Further reading

Wallis, Sean (2012). Goodness of fit measures for discrete categorical data, corp.ling.stats, London: Survey of English Usage, UCL.
Wallis, Sean (2012). Measures of association for contingency tables, corp.ling.stats, London: Survey of English Usage, UCL.
Wallis, Sean (2019). Confidence intervals on pairwise φ statistics, corp.ling.stats, London: Survey of English Usage, UCL.

15. Meta-tests for comparing tables of results

Calculator

separability tests (Excel spreadsheet)
- homogeneity (gradient, point, multi-point)
  2 x 2 χ², Yates's χ², Newcombe-Wilson, Newcombe-Wilson c.c. and φ comparisons
  r x 2 χ², Yates's χ²
  r x c χ²
- goodness of fit
  2 x 1 χ², Wilson, Wilson c.c.
  r x 1 χ²
- homogeneity superset/subset (gradient, point, multi-point)
  2 x 2 χ², Yates's χ², Wilson, Wilson c.c.
- goodness of fit superset/subset
  2 x 1 Wilson, Wilson c.c.

Part 5. Statistical solutions for corpus samples

16. Coping with imperfect data

Calculator

Further reading

Wallis, Sean and Seth Mehl (forthcoming). Comparing baselines for corpus analysis: Research into the get-passive in speech and writing. In: Ole Schützler and Julia Schlüter (eds.) Data and Methods in Corpus Linguistics: Comparative Approaches. Cambridge University Press.

Wilson finite population calculations (Excel spreadsheet)
- Single proportion example — see Chapter 8
- Newcombe-Wilson test adjustment following resampling

17. Adjusting intervals for random-text samples

Data and analysis

Large sample analysis of p(inter) and p(CL | word) data (Excel spreadsheet)
Analysis of p(inter) data for each ICE-GB text category (Excel spreadsheet)
Partitoned and pooled analysis of shall / will data in DCPSE (Excel spreadsheet)

Part 6. Concluding remarks

18. Plotting the Wilson distribution

Calculator

Wilson distribution (Excel spreadsheet)
- Wilson, Wilson c.c., Logit Wilson
- Clopper-Pearson (up to n = 10)

19. In conclusion

Appendices

Appendix A. The interval equality principle

Appendix B. Pseudo-code for computational procedures

Glossary

References

Blog

corp.ling.stats

Publisher’s website

Statistics in Corpus Linguistics Research (Routledge)

Please send all comments and questions to s.wallis@ucl.ac.uk

Follow @UCLEnglishUsage

This page last modified 20 January, 2021 by Survey Web Administrator.

UCL Survey of English Usage

Survey of English Usage

Statistics in Corpus Linguistics Research

A New Approach

Citation

Table of contents

Statistics in Corpus Linguistics Research – A New Approach

Contents

Preface

Part 1. Motivations

1. What might corpora tell us about language?

Part 2. Designing effective experiments with corpora

2. The idea of corpus experiments

3. That vexed problem of choice

4. Choice versus meaning

5. Balanced samples and imagined populations

Part 3. Confidence intervals and significance tests

6. Introducing inferential statistics

7. Plotting with confidence

8. From intervals to tests

9. Comparing frequencies within the same distribution

10. Reciprocating the Wilson interval

11. Competition between choices over time

12. The replication crisis and the New Statistics

13. Choosing the right test

Part 4. From effect sizes to meta-tests

14. The size of an effect

15. Meta-tests for comparing tables of results

Part 5. Statistical solutions for corpus samples

16. Coping with imperfect data

17. Adjusting intervals for random-text samples

Part 6. Concluding remarks

18. Plotting the Wilson distribution

19. In conclusion

Appendices

Appendix A. The interval equality principle

Appendix B. Pseudo-code for computational procedures

Glossary

References

Blog

corp.ling.stats

Publisher’s website