next up previous
Next: Computation of the association Up: No Title Previous: Association norms used

Text corpora

In order to get reliable estimates of the co-occurrences of words, large text corpora have to be used. Since associations of the ``average subject'' are to be simulated, the texts should not be specific to a certain domain, but reflect the wide distribution of different types of texts and speech as perceived in every day life.

The following selection of some 33 million words of machine readable English texts used in this study is a modest attempt to achieve this goal:

To compute associations for German the following corpora comprising about 21 million words were used:

For technical reasons, not all words occuring in the corpora have been used in the simulation. The vocabulary used consists of all words which appear more than ten times in the English or German corpus. It also includes all 100 stimulus words and all responses in the English or German association norms. This leads to an English vocabulary of about 72000 and a German vocabulary of 65000 words. Hereby, a word is defined as a string of alpha characters separated by non-alpha characters. Punctuation marks and special characters are treated as words.


next up previous
Next: Computation of the association Up: No Title Previous: Association norms used

Reinhard Rapp
Tue Aug 13 18:20:02 MET DST 1996