This document describes several fundamental concepts related to sense annotations of documents.
Coarse versus Fine Sense Granularity
The generated output has these two granularities. A coarse sense may contain several fine senses (e.g. the noun and the verb related to the same concept, or two senses of a noun that are strongly semantically related, or several named entities that are very similar (e.g. several songs with the same name). An application (e.g., sense matching algorithms) may use both granularities.
A particular fine sense for a particular lemma is called a “sensekey”. In Idilia’s semantic knowledge base, the Language Graph, sensekeys are grouped together into “synsets”, represented as one rectangular box when you’re looking at the Language Graph (e.g. bank/N1 is one sensekey and depository_financial_institution/N1 is another sensekey in the same synset). In a search application, it is possible that one would index the information by sensekey (rather than synset) since two sensekeys aren’t necessarily equivalent. Note that for named entities, alternative forms (e.g. “Hyundai Elantra” vs. “Elantra”) may be represented as separate sensekeys.
The fine senses in the Language Graph often have corresponding entries in well known non-Idilia databases such as Wikipedia or IMDB. For each fine sense found in the result, the textual description of the sense (i.e., gloss) and its known external references are provided.
The sense annotated output formats are designed to represent the inherent ambiguity in linguistic processing. They provide for both sense ambiguity and “lexical path ambiguity” (i.e. ambiguity around whether or not to form multi-word expressions). Note that Idilia’s system will provide ambiguous (i.e. multiple) sense or lexical path answers when it is uncertain. A path will contain either no semantic information (e.g. for punctuation, closed class words) or one or more coarse or fine senses. Each ambiguous element (e.g. senses, parts of speech) has an associated probability and confidence. In addition, there is an overall confidence value for each path.
There are several confidence values, and typically one is attached to each piece of information. The expectation is for the application to use a piece of information only if its confidence is high enough. When the confidence high, using the sense information in an index is appropriate. When the confidence is too low, the indexing should probably fall back on word matching (i.e. equivalent to matching all possible senses). Different confidence thresholds can be used depending on the application’s objective (e.g. improving precision vs. improving recall). Such thresholds could even be adjusted dynamically within an application (e.g. more relaxed thresholds if very few results are found, and much tighter thresholds when many possible results are found).
The named entity (NE) recognizer uses several strategies to eliminate senses that match the surface text but are not related to the document. Nevertheless, the system sometimes ends up with several named entities with the same “subtype” (i.e. named entities that have the same narrow set of properties, such as several songs with the same title or two persons with the same name). For WSD purposes, they are all replaced with a single dynamic NE encompassing all the remaining senses. These remaining senses are available as the “collapsed” members of the collapsed dynamic NE. An application such as an index might include them because they were the remaining candidates. The index should also include the dynamic NE. A collapsed sense is typically interpreted as “any of the specific collapsed senses and any unknown sense with the same properties.”
These are the senses that are combined to form a long compositional dynamic NE. For example, the NE “Union of SteelWorkers” would be composed of the constituent sensekeys “union/N1, steel_worker/N1”. Applications should consider both the constituent senses as well as the dynamic NE, as they are useful for matching, and they explain the compositional semantics of multi-word expressions. This also resolves the endless debate over whether a particular set of words should be grouped together, which is obviously true for “coffee table”, but less obvious for “desk lamp”.
The search synonyms are “equivalent” senses generated dynamically by altering the matched surface text. Some of these changes are simple (e.g. reversing first and last name for a person) and some are much more complicated. Synonyms are generated for dynamic NEs as well as Language Graph NEs. Many Language Graph NEs already have alternate sensekeys added during the population of the Language Graph. However, the additional search synonym forms are less reliable, and are used only in a local document context.
The output includes the grammatical dependencies extracted from the parse of each sentence. For “bag of words” type text (e.g. queries), dependencies are normally less reliable. But for documents (or queries with good syntax) they are more reliable.