Language Independent Methods of Clustering Similar Contexts (with applications)
|Venue:||EACL 2006 in Trento, Italy|
|Date:||April 4, 2006|
pdf presentation slides (4 MB)
flv flash video (714 MB, 320 * 240 pixels)
mpeg2 video part 1 (1300 MB, 720 * 480 pixels)
mpeg2 video part 2 (800 MB, 720 * 480 pixels)
Methods that identify similar (but not identical) units of text have wide potential application. For example, Web search results can be better organized by grouping together pages with related and similar content. Email can be automatically foldered and categorized by finding which messages are similar to each other. Word senses can be discovered by clustering multiple contexts that use a particular ambiguous word.
This tutorial introduces a language independent methodology for identifying similar contexts based on lexical features. It explores the use of first and second order co-occurrence vectors for representing contexts, and introduces methods for carrying out dimensionality reduction that lower the noise and computational complexity associated with these large feature spaces. A number of different clustering methods is discussed, as are various methods of evaluating the quality of the clustering results. Finally, the tutorial explores methods of automatically generating descriptive labels for clusters.
The original tutorial also included a hands-on option for those with laptop computers (not on video). Attendees were given a bootable Knoppix CD that let them experiment with many of these ideas and applications using the SenseClusters package http://senseclusters.sourceforge.net
For further reading, publications related
to the tutorial content can be found at the following URL: