Language Independent Methods of Clustering Similar Contexts (with applications)

Ted Pedersen


Here you should see a video. Please install the Adobe flash player browser plugin and activate Java script to watch it online, or download it using the links below.


Presenter: Ted Pedersen
Type: Tutorial
Venue: EACL 2006 in Trento, Italy
Date: April 4, 2006
Recording: Reinhard Rapp
Duration: 135 minutes



pdf presentation slides (4 MB)

flv flash video (714 MB, 320 * 240 pixels)

mpeg2 video part 1  (1300 MB, 720 * 480 pixels)

mpeg2 video part 2  (800 MB, 720 * 480 pixels)



Methods that identify similar (but not identical) units of text have wide potential application. For example, Web search results can be better organized by grouping together pages with related and similar content. Email can be automatically foldered and categorized by finding which messages are similar to each other. Word senses can be discovered by clustering multiple contexts that use a particular ambiguous word.

This tutorial introduces a language independent methodology for identifying similar contexts based on lexical features. It explores the use of first and second order co-occurrence vectors for representing contexts, and introduces methods for carrying out dimensionality reduction that lower the noise and computational complexity associated with these large feature spaces. A number of different clustering methods is discussed, as are various methods of evaluating the quality of the clustering results. Finally, the tutorial explores methods of automatically generating descriptive labels for clusters.

The original tutorial also included a hands-on option for those with laptop computers (not on video). Attendees were given a bootable Knoppix CD that let them experiment with many of these ideas and applications using the SenseClusters package

For further reading, publications related to the tutorial content can be found at the following URL: