The text corpora were read in word by word. Whenever one of the 100 stimulus words occured, it was determined which other words occured within a distance of twelve words to the left or to the right of the stimulus word, and for every pair a counter was updated. The so defined frequencies of co-occurrence , the frequencies of the single words H(i) and the total number of words in the corpus Q were stored in tables. Using these tables, the probabilities in formula (4) can be replaced by relative frequencies:
In this formula the first term on the right side does not depend on j and therefore has no effect on the prediction of the associative response. With H(j) in the denominator of the second term, estimation errors have a strong impact on the association strengths for rare words. Therefore, by modifying formula (5), words with low corpus frequencies had to be weakened.
According to our model the word j with the highest associative strength to the stimulus word i should be the associative response. The best results were observed when parameter was chosen to be 0.66. Parameters and turned out to be relatively uncritical, and therefore to simplify parameter optimization were both set to the same value of 0.00002.
Ongoing research shows that formula (6) has a number of weaknesses, for example that it does not discriminate words with co-occurrence-frequency zero, as discussed by Gale & Church (1990) in a comparable context. However, since the results reported on later are acceptable, it probably gets the major issues right. One is, that subjects usually respond with common, i.e. frequent words in the free association task. The other is, that estimations of co-occurrence-frequencies for low-frequency-words are too poor to be useful.