- SUBTLEX-PT-BR: Brazilian Portuguese Frequency Corpus using Subtitles. Tang, K. (2013) A 61 Million Word Corpus of Brazilian Portuguese Film Subtitles as a Resource for Linguistic Research. UCL Working Papers in Linguistics 24.
Three versions of the corpus are available below:
- Unigram (The most basic version, with OLD20 (Orthographical Neighbourhood Density)
- Lemmatised and Part Of Speech Tagged (Useful for finding the frequency of lemmas and their relative forms and Part of Speech)
- Bigram (Useful for collocation frequency and identifying compounds)
[Feel free to email kevin.tang.10@ucl.ac.uk for the details of these different versions]
- Wuggy Brazilian Portuguese Module: Pseudo-word generator for Brazilian Portuguese using SUBTLEX-PT-BR word list. [Coming soon]
- FindUniqueCharacters is a very short script to tell you which unique characters you have in a file. (Useful for checking whether there's anything missing from the transcription convention you have, for example.) This .exe requires .Net (Windows computers will already have it, other OSs can use Mono). [Download it here] [Creator: Elizabeth Eden, elizabeth.eden.11@ucl.ac.uk]
- LatexifyUnicodeIPA takes a file with a list of words in Unicode and outputs a file with them in TIPA (Latex format), or vice versa. It can handle diacritics and other multi-character Unicode formats. It currently cannot handle multiple words on the same line. This .exe requires .Net (Windows computers will already have it, other OSs can use Mono). [Download it here] [Creator: Elizabeth Eden, elizabeth.eden.11@ucl.ac.uk]
- CheckPinyin takes an input file with a list of words (separated by spaces if more than one per line) and checks that each one can be found in a reference file, ignoring case. (As the name suggests, written for Pinyin.) It outputs 2 files: a list of valid and a list of invalid lines. [Download it here] [Creator: Elizabeth Eden, elizabeth.eden.11@ucl.ac.uk]
- Cleaning tool for Hayes' Phonotactic Learner [Coming soon]