ULM-1: The borders of ambiguity

Subproject ULM-1 will explore the closed world of language as a system of word relations. The goal is to more properly define the problem and find the optimal solution given the vast volumes of textual data that are available.

Project nr 1
Title The borders of ambiguity

This project starts from the results obtained in the DutchSemCor project, funded by NWO as ‘Middelgroot investeringsproject’ from 2009 – 2011. In DutchSemCor (DSC), about 3,000 most polysemous and frequent words (nouns, verbs and adjectives) have been annotated for their meaning in texts taken from SoNaR, CGN and the Dutch Internet. The project resulted in almost 400K manual annotations and millions of automatic annotations. The project generated a wealth of interesting data but also many new questions to be answered. In this project, we will address these questions.

About 5,000 words in Dutch have between 3 and 18 different meanings and belong to the most frequent words of the language. Examples of these words are band (12: band, bond, tire, etc.), slag (18: blow, manner, stroke, victory, etc.), stuk (12: piece, woman, music, etc.). Between 60% and 80% of the words in most texts (depending on the genre and domain) consists of these words. A text of 1,000 words will thus have about 600 occurrences of these words which represent approximately 3600 combinations of meaning, if we consider all options.

The task of deciding on the meaning of a word by computers is called Word-Sense-Disambiguation (WSD). Current computational approaches to WSD consider the text word-by-word, solving each problem in isolation from the other problems. In that way, all combinations are considered, while only the neighbouring words are used and not their meanings. Likewise, WSD is computationally expensive and slow.

In addition to the exploding combinatorics of ambiguity, each of these words also represents a different type of problem. The ambiguity of band is different from the ambiguity of stuk and asks for a different solution to determine its meaning in text.

A third aspect to consider is the enormous diversity of our language. The manual annotation in DSC focused on finding an equal number of examples for each sense, i.e. a minimum of 25 occurrences per sense. In the case of slag above, the annotators thus had to find 450 occurrences that fit the 18 different meanings well. This strategy revealed a number of insights:

  • A computer trained and tested on the manually chosen examples performs with more than 80% accuracy, which is close to the human interannotator agreement. However, when we take a random sample from SoNaR outside the annotated part, the performance drops with 20 points!

Despite the amount of 400K annotations, we are not coping sufficiently with the diversity of language as represented by SoNaR. In addition to the contexts that fit the different meanings very well, there are many other contexts still not represented which make the system drop 20 points in accuracy.

  • 28% of the annotations could not be found in SoNaR but had to be retrieved from the Dutch Internet.

Despite its diversity and size (500 million tokens), SoNaR is apparently not big and diverse enough to reflect all the different meanings of these common words. These meanings are accepted by all annotators and their supervisors and have been found on the Internet.

From these observations, we can conclude that the ambiguity of natural language is a problem that is not well-understood. We do not even have a clear idea about the size and complexity of the problem. In this project, we are going to investigate the problem systematically by determining the relation between the 3 variables: Word (W) – Meaning (M) – context (C):

  • Domain and genre layers in SoNaR and the annotation obtained from the Internet (28%)
    • define clusters of word tokens in terms of their structural properties:
      • part-of-speech sequences, chunks, dependencies
      • bags of words
      • phrases and idionatic expressions
    • relate these clusters to domain and genre labels
    • derive sense distributions over these clusters
    • Qualify and quantify the variables: how big is the variation and the universe of contexts
    • Link the ‘clusters’ to the sense annotations and the performance of the WSD systems
  • Define polysemy classes for the ambiguous words:
    • sense-groups
    • semantic and syntactic properties
    • Qualify the words according to their polysemy profile
    • Link polysemy profiles to the sense annotations and the performance of the WSD systems
  • Derive distributional vectors from Lassy-groot and map these onto the annotated SoNaR vectors
  • Create proper evaluation data for Dutch and English that is a true sample for the universe of contexts and usage
  • Unsupervised acquisition of domain-genre texts that matches the domain-genre properties of the annotated texts.
  • Create a base-concept classifier instead of a sense-classifier
  • Resolve ambiguity as an overall strategy and not as a word-by-word problem. Use the meaning-to-meaning relations that have been extracted from annotation in the same sentence, paragraph and document.
  • Babelnet for Dutch: linking the Dutch Wordnet to Wikipedia and DBPedia to increase the relations between ambiguous word meanings

The project will be carried out for Dutch and for English. The PhD will focus on theoretical aspects of polysemy and contexts and will derive a lexical representation from the corpus data to formally represent the relation between word – meaning and context.The PostDoc will implement the WSD systems and unsupervised acquisition software as well the supervision of the PhD.

Leave a Reply

Your email address will not be published.