Natural Language Processing massively uses data for supervised and unsupervised learning. Frequency of forms and relations plays a major role both in the derived models and in the evaluation of these models. But event and entity instances in the world have no frequency, they just exist for some time. Frequency in data comes from our communication about these instances. Due to our biased interest, the expressions we use to refer to events and entities have a strong frequency profile, following a roughly Zipfian distribution (Zipf, 1949), featuring a small amount of very frequent observations and a very long tail of less frequent observations. Since our NLP datasets sample texts but do not sample the world, they are no exception to Zipf’s law. Thus, the salient interpretations are very prominent in our test data, which causes NLP methods to exploit this redundancy by ignoring hard cases and relying only on straightforward ones. The same is true for large commonsense knowledge repositories that encode explicit semantics such as DBpedia, Wikidata, Freebase, BabelNet, and WordNet. Research and practice have shown these knowledge bases to be of immense value in NLP (IBM Watson, Moro et al., 2014, inter alia), but they similarly tend to focus on the most “popular” classes, instances, and relations between them.
This bias towards the head favors models of language that tend to rely on statistically obvious cases that do not require very deep understanding or reasoning. Typically, their performance is high when the test cases match the most frequent cases, and very low when they belong to the long tail (Postma et al., 2016). Interestingly enough, humans do not suffer from overfitting in the same way as machines do. They can perfectly handle long tail phenomena. Little attention has been devoted to how systems should solve interpretation tasks for local and perhaps unknown event and entity instances, which are described in sparse documents, but might not be observed in any training set or knowledge base. Potentially, this would require new representations and deeper processing than the ones that work well on the head, which involves reading between the lines, e.g. textual entailment and robust (common sense) reasoning. How can systems gather the necessary knowledge to correctly interpret low frequent long tail entities and events? How can systems establish identity and reference across text sources which are not or poorly represented in knowledge resources? How should systems exploit popular data to correctly interpret low frequent data?
Various concrete aspects of the long tail have become of interest in recent years. Vagueness and ambiguity, while long recognized as features of natural language, are in the long tail phenomena for which we do not have meaningful data for training and evaluation. Time, location, and other modalities, as well as overly strict semantic targets introduce issues of granularity into the NLP problem that are typically ignored in evaluation. Many researchers believe relational semantics to be important, yet the amount of data for the less frequent relations is staggeringly small. Notably, the data scarcity aspect of the long tail has been addressed for several NLP tasks, such as relation detection, entity typing, document filtering, enrichment, discovering of emerging entities, and open information extraction.
Through this workshop, we want to gather NLP and Knowledge Representation researchers to share ideas about the long tail for the semantic processing of text with a special focus on the task of disambiguation and reference. We aim at a critical discussion of the complexity and relevance of long tail phenomena. We hope to find an incentive for the community to consider the long tail as a first-class citizen in NLP tasks. We hope to encourage new approaches, which would ideally be able to deal with knowledge and data sparseness, as well as contextual ambiguity with respect to aspects of time, location, and topic.
We have been engaged with several complementary efforts to make the community aware of the relevance and complexity of this problem. In June 2016, we organized the workshop Looking at the Long Tail at Vrije Universiteit Amsterdam, which brought together experts from various fields: NLP, Information Retrieval, Machine Learning, Knowledge Representation and Reasoning. In addition, we have quantified the semantic overfitting towards the head in disambiguation and reference datasets (Ilievski et al., 2016). Based on these observations, we proposed an approach to move away from overfitting to the head towards interpretation of long tail meanings (Postma et al., 2016). We are currently organizing a SemEval-2018 task based on this proposal, for more information please check the task website.
The long tail phenomena are novel and very challenging AI- and NLP-wide problems that should be the focus of a global audience interested in semantic NLP. We consider the IJCNLP workshop as a valuable venue for this purpose.
We want to gain insights with respect to how to address the semantic long tail in NLP systems, eventually to extract detailed knowledge on event and entity instances from unstructured text. The following topics can be used as a guide for submissions.
We believe that these topics are crucial to improve the state-of-the-art in NLP with respect to long tail phenomena, which in turn should have a major impact on overall language understanding. We are interested in systems that reveal interesting insights for addressing long tail aspects, even if their overall performance is lower than the state-of-the-art.
The SLT-1 workshop is collocated with the IJCNLP conference, which in 2017 will be held from November 27th until December 1st in Taipei, Taiwan.
Workshop Website and First Call for Paper Ready: May 1, 2017
Second Call for Paper Sending-out: July 5, 2017
Third Call for Paper Sending-out: August 5, 2017
Paper Submission Deadline: September 5, 2017
Notification of Acceptance: September 30, 2017
Camera-Ready Deadline: October 10, 2017
Workshop day: December 1, 2017
Martha Palmer (University of Colorado Boulder)
Chris Welty (Google)
Eduard Hovy (Carnegie Mellon University)
Ivan Titov (University of Edinburgh)
Philipp Cimiano (University of Bielefeld)
Frank van Harmelen (VU Amsterdam)
Eneko Agirre (University of the Basque Country)
Key-Sun Choi (Korea Advanced Institute of Science and Technology)
Agata Cybulska (Oracle)
Anders Søgaard (University of Copenhagen)
Andre Freitas (University of Passau)
Anselmo Peñas (UNED Madrid)
Antske Fokkens (VU Amsterdam)
Barbara Plank (University of Groningen)
Brian Davis (National University of Ireland Galway)
Dirk Hovy (University of Copenhagen)
Giuseppe Rizzo (ISMB, Turin)
Jacopo Urbani (VU Amsterdam/Max Planck Institute for Informatics)
Johan Bos (University of Groningen)
Lea Frermann (University of Edinburgh)
Leon Derczynski (University of Sheffield)
Karthik Narasimhan (Massachusetts Institute of Technology)
Marco Rospocher (Fondazione Bruno Kessler, Trento)
Marieke van Erp (VU Amsterdam)
Pradeep Dasigi (Carnegie Mellon University)
Ridho Reinanda (University of Amsterdam)
Sabine Schulte im Walde (University of Stuttgart)
Sara Tonelli (Fondazione Bruno Kessler, Trento)
Sebastian Pado (Stuttgart University)
Stephan Oepen (University of Oslo)
Sujay Kumar Jauhar (Carnegie Mellon University)
Tim Baldwin (University of Melbourne)
Tommaso Caselli (VU Amsterdam)
Introduction and the perspective of the organizers
Presentation of accepted submissions