— Similarity, co-occurrence, functional relation, part-whole relation, subcategorization, what else?

In word sense disambiguation and named-entity disambiguation, an important assumption is that a document consists of related concepts and entities.

There are millions of concepts and entities, what makes some related but not others? This question is difficult and I don’t have the definitive answer. But it is a good start to list some classes of relatedness.

Similarity is often considered as a special case of relatedness. Apparently the terminology is confusing here as topical similarity and domain similarity have been used in place of relatedness (see Hill et al. 2014). I would like to talk about the similarity that holds between bicycle and scooter, president and director, Cristiano Ronaldo and Zinedine Zidane. As you may readily see, two things or persons in these pairs are related by virtue of their various shared characteristics. It is not just the use of bicycles and scooters that make them similar, they also resemble each other in appearance, structure, material, mechanism… Football fans may disagree on the similarity of CR7 and Zizou but think of the similarity between, for example, Cristiano Ronaldo and Barack Obama, you will see that the two football players resemble each other more than they resemble most things and people in the world.

Co-occurrence is the favorite relatedness of computational linguists. Here, two things are related because they happen to come together, either in the mind or in the world, but researchers mostly extract it from text. Well, of course, things don’t just happen to come together, they do so for a reason. But we hardly care, all that matters is numbers like PPMI, t-test, SVD, cosine similarity, etc. The criticism of statistical techniques in general is usually silenced on the ground that they give good results. For the same reason I think, although they are ugly and give little insight if any, they are here to stay.

Functional relation (sitchair), part-whole relation (cartire), subcategorization (gravyboat) are some of many relations that can hold between two concepts. A paper lists 15-ish types of relations, many of them can be divided further. If we take into account ad-hoc relations such as authorship (J. K. RowlingHarry Potter), location (Anne FrankAmsterdam), marriage (DavidVictoria Beckham)… there can possibly be as many types of relation as there are types of human activities and people keep creating new ones.

While they don’t necessarily coincide with co-occurrence, it is very hard to quantify the added value of those relations. It might very well be the case that a naive NLP system equipped with only co-occurrence information achieves 90% accuracy and all the other information improves it to 91%. One unfortunate trait of modern NLP is that if numbers (accuracy, F1, BLEU,…) are not good, the research is not valued.

In short, similarity and co-occurrence are the most important kinds of relatedness in NLP. While there are many others, they will attract little attention until their usefulness is proven.

Leave a Reply

Your email address will not be published.