2ND SPINOZA WORKSHOP: “LOOKING AT THE LONG TAIL”

*** Slides from the invited talks:

*** Videos of all presentations are available online now!! Check the playlist:

Many natural phenomena show a Zipfian distribution (Newman, 2005), in which a small amount of observations are very frequent and there is a very long tail of low frequent observations. The distribution of symbols in natural language and their meanings are no exception to Zipf’s law. Within a language community and a period of time, e.g. a generation, a few expressions are extremely frequent and are used in their most frequent meaning, whereas there are many expressions and meanings that we find rarely. This has big consequences for the computer models that are built from these observations: they tend to suffer from overfitting to the most frequent cases. As long as the tasks on which we test these models also show the same distribution, these models perform quite well. However, this favours models that tend to rely on statistical obvious cases and it does not require very deep understanding or reasoning. Typically, their performance is high when the test cases match the most frequent cases, and very low when they belong to the long tail. Interestingly enough, people do not suffer from overfitting in the same way as machines do. They can perfectly handle long tail phenomena as well. In this workshop, we want to address the long tail in the semantic processing of text with a focus on the task of disambiguation. We need to find an incentive for the community to consider the long tail as a first-class citizen, either through integrating it into evaluation metrics, and/or representing the long tail world into the (evaluation) datasets and knowledge bases. This would encourage the development of systems that have a better understanding of natural language and are able to deal with knowledge and data sparseness.

Aim of the Workshop

The goal of the current Long Tail workshop is to discuss the starting points and motivation for a future workshop and task, as well as the design, the data, evaluation and possible systems with a selection of experts in the field. Depending on the outcome of the Spinoza “Long Tail” workshop, we will also consider a special issue journal publication together with the workshop speakers. We plan to build upon the results of this workshop by creating a disambiguation task which has a strong focus on the long tail phenomenon. Such a task requires the design and collection of the data that represent the long tail, as well as adequate evaluation methods. We aim to propose this task as a “Long Tail Shared Disambiguation Task” to the next call for SemEval-2018 tasks, which is expected late 2016/early 2017. In addition, we plan to propose a workshop for ACL 2017, which will be dedicated to interest the community in the task, discuss the acquisition of the data and explore possible systems that would optimize for this task.

Structure of the Workshop

The 2nd Spinoza Workshop “Looking at the Long Tail” will consist of two main sessions:

1. Invited Speakers

The invited speakers span from various fields of expertise:

Natural Language Processing
Information Retrieval
Knowledge Representation and Reasoning
Machine Learning

They will address the phenomena of overfitting and low long-tail performance from their own disciplines.

2. Datathon

The practical session is organized as a datathon and will consist of four tracks:

Datasets and acquisition (Draft programme PDF)
Resources (Draft programme PDF)
Evaluation (Draft programme PDF)
Systems (Draft programme PDF)

As a starting point, we will provide a collection of evaluation methods, data sets, knowledge bases, and system results for this datathon to analyse and discuss. We also provide scripts for data analysis.

Programme

	From	To
Welcome	09:00	09:30

PART I: INVITED SPEAKERS
Introduction by the Organizers	09:30	10:05
Maarten de Rijke Long tail entities In document filtering for entities, systems process a time-ordered corpus for documents that are relevant to a set of entities in or- der to select those documents that contain vital information. State-of-the-art approaches to document filtering for popular entities are entity-dependent: they rely and are trained on the specifics of differentiating features for the entity at end; moreover, they tend to use so-called extrinsic information such as Wikipedia page views and related entities, which tends to be widely available only for popular entities. Entity-dependent approaches based on extrinsic signals are ill-suited as document filtering methods for long-tail entities as the information available about such entities is either very sparse or absent altogether. In the talk I discuss a document filtering method for long-tail entities. The method is entity-independent— which thus generalizes to unseen or rarely seen entities—and is based on intrinsic features, i.e., features that are derived from the documents in which the entities are mentioned. We design features aimed to capture informativeness, entity-saliency, and timeliness. This is joint work with Ridho Reinanda and Edgar Meij.	10:05	10:30
Antal van den Bosch Natural language statistics, memory, and the analogical proportion Patiently, a group of theoreticians is awaiting the end of the pointless debate between those who myopically see the language as events needing to be converted to probabilities, and those who bash statistical models of language as scientifically uninteresting. This group is formed by the French proportional analogists (the direct descendants of the Neogrammarians and De Saussure), the Analogical Modeling school of Royal Skousen, and the Dutch-Flemish Memory-based language models school. The data-centricity of their ideas allows for the best of all worlds, which I aim to illustrate in a few case studies.	10:30	10:55
Coffee Break	10:55	11:10
Johan Bos Meaning Banking and the Long Tail In this talk I present ongoing work on meaning banking (i.e. building large corpus of texts annotated with formal meaning representations) and discuss the problem of the long tail. I will do this by showing some well-known phenomena that suffer from the long tail problem and by arguing that distributional contexts might be a way towards a solution.	11:10	11:35
Stefan Schlobach Knowledge distribution and its Long tails In my talk, I will present observations about the knowledge distribution and its long tail in data on the web, with respect to factors like language distribution, graph degree, datatypes, meaning of names etc.	11:35	12:00
Ivan Titov Exploiting Unlabeled Data in Learning Shallow Semantics The lack of accurate methods for predicting meaning representations of texts is the key bottleneck for many natural language processing applications such as question answering or text summarization. Although state-of-the-art semantic analyzers work fairly well on closed domains (e.g., interpreting natural language queries to databases), accurately predicting even shallow forms of semantic representations (e.g., underlying predicate-argument structure) for less restricted texts remains a challenge. The reason for the unsatisfactory performance is reliance on supervised learning, with the amounts of annotation required for accurate open-domain parsing exceeding what is practically feasible. Long-tail performance of supervised methods is especially problematic. In my talk, I will discuss approaches which induce semantic representations primarily from unannotated texts. Specifically, I will introduce a new approach, called reconstruction-error minimization (REM) for semantics, and show that it not only leads to state-of-the-art performance in relation discovery and unsupervised semantic role induction but also lets us specialize the semantic representations for (basic forms of) semantic inference.	12:00	12:25
Lunch	12:25	13:25
Keynote by Eduard Hovy Filling the Long Tail Most people agree that semantics is what enables people to handle long-tail phenomena of understanding. But what is ‘semantics’ exactly? By its nature, the long tail must consist of many, each quite small, phenomena, activated and used as needed. This implies a quite flexible dynamic architecture and a lot of knowledge of various kinds. In this talk we discuss some of the main types of knowledge required, providing examples of how it can be acquired at large scale.	13:25	14:05

PART II: DATATHON
Introduction to the Datathon tracks	14:05	14:15
Discussion & Brainstorming	14:15	14:45
Hands-on	14:45	16:05
Coffee Break	16:05	16:20
Conclusions	16:20	17:05
Central Presentations	17:05	17:45

WRAP-UP
Towards a SemEval ’18 Shared Task	17:45	18:00
Drinks	18:00	–

Related research

Communication

We encourage to join and use the following email address for all communication related to the workshop: spinoza-workshop-looking-at-the-long-tail@googlegroups.com

And, let us know about your impressions and find out about others’ on Twitter, using the official hashtag of the workshop: #SpinozaLongTail !

Directions to the Workshop venue

Room: Forum 2 at Floor 1, Wing D
Main building Vrije Universiteit Amsterdam
De Boelelaan 1105
1081 HV Amsterdam

You might use this link: How to get to Vrije Universiteit Amsterdam

Route from entrance Main Building VU University follow signs: ‘Forum, Wing D’

Take the stairs (indicated by signpost) to Floor 1.
Forum 2 is the first room on your right.
When you arrive at the Main Building VU and would like assistance, please ask a host in the hall near the main entrance.

Can not make it?

For people that can not attend this workshop, we aim to broadcast the event LIVE. Pictures and videos will also be taken during the event, and published shortly after it.

Acknowledgements

The research for this project was supported by the Netherlands Organisation for Scientific Research (NWO) via the Spinoza fund.

Organizing Committee

Piek Vossen (p.t.j.m.vossen@vu.nl)
Filip Ilievski (f.ilievski@vu.nl)
Marten Postma (m.c.postma@vu.nl)
Selene Kolman (s.j.j.kolman@vu.nl)