Recent incidents witness that digital identity can easily be confused, stolen, or abused. Consider, for instance, the case of the former Brittish MP, Lord Alistair McAlpine, who was falsely accused of child abuse due to mistaken identity in November 2012.
People, organizations, and other entities are described in many textual documents, including news articles, encyclopedias, and social media. Hence, identity is increasingly becoming a practical matter. There is urgency for developing robust computational tools that can correctly determine the identity of entities in text.
Today, it is well-understood how to determine identity of entities (e.g., Trump) with low ambiguity and high popularity/frequency in communication (the head), as witnessed by the high accuracy scores in the standard Natural Language Processing (NLP) task of Entity Linking. It is unclear, however, how to interpret long-tail entities (e.g., Filip): each different and potentially very ambiguous, with low frequency/popularity, and scarce knowledge.
This research investigates how the performance of NLP techniques for establishing the identity of long-tail cases can be improved through the use of background knowledge. It focuses on five aspects of this challenge: description/definition, analysis of evaluation, improvement of evaluation, enabling access to more knowledge, and building knowledge-intensive systems.
This thesis demonstrates that interpreting long-tail entities in text is an under-addressed and multifaceted challenge. Better evaluation and more extensive use of knowledge are promising directions forward.