Words have many meanings and meanings can be expressed by many words. That is what is being studied in this thesis. People are very good at interpreting language, while computers have more difficulty with this.
To make computers better at interpreting language, researchers have opted to cut the complex problem into pieces. An example of this is the Word Sense Disambiguation task, of which the purpose is to indicate what a word means in a particular sentence, e.g., what means the word horse in the sentence with that move of the horse, the opponent is checkmate. But many more tasks have been devised. Consider, for example, interpreting proper names, time expressions or indicating which expression is about the same entity, e.g., that Jan and he refer to the same person in the sentence Jan goes home, and he goes to eat a cookie.
In this project, we investigate the Word Sense Disambiguation task. The task is one of the oldest in the field, and many approaches have been tried to solve the task. We wonder why this is the case and what we can learn from this.
The following publications are the main publications from the ULM project 1:
Minh Le, Marten Postma, Jacopo Urbani, and Piek Vossen. 2018. A Deep Dive into Word Sense Disambiguation with LSTM. In Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018)
Marten Postma, Ruben Izquierdo, and Piek Vossen. 2016. More is not always better: balancing sense distributions for all-words Word Sense Disambiguation. In Proceedings of the 26th International Conference on Computational Linguistics (COLING 2016): Technical Papers
Filip Ilievski, Marten Postma, and Piek Vossen. 2016. Semantic overfitting: what ‘world’ do we consider when evaluating disambiguation of text? In Proceedings of the 26th International Conference on Computational Linguistics (COLING 2016): Technical Papers.
Marten Postma, Ruben Izquierdo, Eneko Agirre, German Rigau, and Piek Vossen. 2016. Addressing the MFS Bias in WSD Systems. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)
In order to get a better understanding of the ambiguity of natural language, several resources have been created within the project:
- WordNetMapper: This repo provides the possibility to map between lexical keys | offsets | ilidefs from one wordnet version to the other [“16″,”17″,”171″,”20″,”21″,”30”]. It makes use of the index.sense files from WordNet (http://wordnet.princeton.edu/) and the automatically generated mappings between WordNet offsets (http://nlp.lsi.upc.edu/tools/download-map.php)
- Semantic_class_manager: python module for accessing to diverse semantic classes: BLC, WordNet Domains and SuperSense
- Sval_systems: output from participating systems on WSD all-words tasks in a common format