ULM-4: A quantum model of text understanding
Subproject ULM-4 is a technical project that investigates a new model of natural-language-processing. Current approaches are based on a pipeline architecture, in which the complete problem is divided in a series of smaller isolated tasks, e.g. tokenization, part-of-speech-tagging, lemmatisation, syntactic parsing, recognition of entities, detection of word meanings. In this new model, none of these tasks is decisive and the final interpretation is left to higher-order semantic and contextual models.
|Title||A quantum model of text understanding|
This project also builds on the findings of previous (KYOTO) and ongoing European (OpeNER and NewsReader) and national (BiographyNet) projects carried out at the VU University Amsterdam. The goal is to develop a new model of natural-language-processing in which text is interpreted in a combined top-down and bottom-up proces.
Current approaches to Natural Language Processing are based on a pipeline architecture, in which the complete problems is divided in a series of smaller tasks, e.g. tokenization, part-of-speech-tagging, lemmatisation, syntactic parsing, recognition of entities, detection of word meanings, detection of relations and opinions. Specialized modules solve each subtask and pass the output to the next task, where low-level analysis is used to build up a higher-level analysis. One of the problems with this approach is that each module makes errors that affect the performance of the next module. To solve these errors low-level modules need to have access to the richer knowledge in following modules, which is not possible in a pipeline architecture: the process goes in one direction only and there is no back-tracking. Previous research has shown that second or third best-choices of submodules can lead to better results based in their fit towards a higher semantic interpretation model.
Two PhD students will investigate the possibilities of a new architecture in which all options of lower-level modules are left open and final decisions are taken from a top-down perspective, given knowledge-rich background models and contexts. The interpretation of each single module and piece of information within that model thus depends on the interpretation mechanism that is applied. Words and tokens in text therefore do not get their meaning solely from the direct local context (e.g. surrounding words) but dominantly from higher interpretative functions. This approach has important consequences for the representation of our analysis and our ways of reasoning and deciding. Instead of just passing on the single best-choice in each step, all possible interpretation are left open, leading to an exploding cumulation of choices and options with probabilities. Furthermore, the interpretation of text, represented as a cumulation of choices, needs to be made sensitive to their fit to background knowledge, world-views, perspective and explanatory structures defined in the other projects.
The first PhD will model ambiguities at all levels of processing and will experiment with different reasoning and decision mechanism to resolve these ambiguities given contextual background models. This is a considered as a more natural way of signal processing for people, who combine bottom-up pattern matches with higher-order interpretative functions as well. Likewise, people do not notice errors and omissions at lower levels as long as they do not interfer with these higher functions. The second PhD will formally define the background models and the notion of context as higher conceptual structure for interpretation and reasoning. These models are based on formal Semantic Web representations (such as RDF) and exploit the available data in the Semantic Web.
Ultimately, this new approach to assigning meaning to language tokens should better mimic human intuitions of language. Humans do not experience the combinatoric explosion of choices that computer systems now face. They are hardly aware of the interpretations that do not match the interpretation model. We expect that computer systems that decide when trying to interpret according to interpretation functions will be able to override possible erroneous analysis of low-level modules that solve subtasks. Interpretation of these smaller units is thus determined by interpretation, analogue to a quantum approach to establish values. This will result in robust ways of reasoning over representations of textual data that can deal with extremely noisy and sparse information.
The project will build on the formal data structure for provenance that are currently defined in BiographyNet and NewsReader. This provenance model traces all interpretation steps of text from each module up to a semantic representation that is independent of text. More efficient ways of representing all options in complex knowledge and data graphs need to be developed and ways of efficiently applying background and context models. Evaluations will be carried out on big data sets developed in the NewsReader project in comparison to traditional pipeline architectures.