Understanding the Semdoc Format
Overview
The Semdoc (semantic document) format is the annotated output from the sense analysis of a document or a query. The semdoc XML document consists of two main sections:
- A preamble “SensesInfo” which contains the information about the sensekeys present in the document.
- One or more document elements further subdivided into paragraphs, sentences, and finally into fragments and their senses (doc → paragraph → sentence → fragment). The “txt” element of the “fragment” element corresponds to the matching text in the input document.
Please consult Sense Annotation Concepts if not familiar with core concepts such as sensekeys, constituents, etc.
The reference for the Semdoc format is the schema "semdoc.xsd".
Sense Properties
Each document contained in the "semdoc" output (normally only one document), includes an element "sensesInfo" of type "SensesInfo". This element includes multiple elements "sense" of type SenseInfo, one for each sensekey or constituent present in the annotated document. These elements describe the senses' properties as defined in the Language Graph.
SenseInfo
This element supports the following attributes:
| Attribute | Description |
|---|---|
| fsk | The fine sensekey of this sense. E.g., dog/N1. |
| csk | The coarse sensekey of this sense. E.g., dog/C1. |
| isne | This attribute is present and equal to "1" when the sensekey is a named entity. E.g., New_York/N1 will have this attribute present. |
and the following elements:
| Element | Description |
|---|---|
| desc | Textual description of the sense. E.g., "a member of the genus Canis that has been domesticate by man." |
| constituent | Decomposition of the fine sense into the senses which are aggregated to form it. May occur 0 or more times. E.g., Union_of_Steel_Workers/N1100 will report constituents to steelworker/N1, union/N1, steel/N1 |
| extRef | Establishes a correspondence between a fine sense defined in Idilia's Language Graph with the same concept defined in another authoritative source such as Wikipedia. May occur 0 or more times. |
| syn | Alternate surface forms for the fine sense which can be useful in search applications. May occur 0 or more times. |
Annotated Document
The original document is split into multiple fragments ("frag"), where each fragment is associated with a sense tag. The following example was obtained from processing the text: The band played their hit song “Alive”.
<doc len="10">
<para len="10" so="0">
<sent len="10" so="0">
<frag len="1" so="0" sol="1">
<txt>The </txt>
<lc lc="det"/>
</frag>
<frag cccfmp="0.930" cccmp="0.993" ccfmp="0.886" len="1" so="1">
<txt>band </txt>
<cs pb="1.000" pc="1.000" sk="band/C2" so="1">
<fs pb="1.000" pc="0.988" sk="band/N5" so="1"/>
</cs>
<dep c="0.816" dest="played" destLc="verb" role="argument" src="band" srcLc="noun" type="agent"/>
</frag>
<frag cccfmp="0.526" cccmp="0.766" ccfmp="0.547" len="1" so="2">
<txt>played </txt>
<cs pb="1.000" pc="0.986" sk="play/C3" so="2">
<fs pb="0.515" pc="0.797" sk="play/V6" so="2"/>
<fs pb="0.485" pc="0.797" sk="play/V7" so="2"/>
</cs>
<dep c="0.816" dest="band" destLc="noun" role="predicate" src="played" srcLc="verb" type="agent"/>
<dep c="0.428" dest="Alive" destLc="noun" role="predicate" src="played" srcLc="verb" type="theme"/>
</frag>
<frag len="1" so="3">
<txt>their </txt>
<lc lc="det"/>
</frag>
<frag cccfmp="0.907" cccmp="0.866" ccfmp="0.897" len="2" so="4">
<txt>hit song </txt>
<cs len="2" pb="1.000" pc="0.927" sk="hit_song/C4" so="4">
<fs len="2" pb="1.000" pc="0.927" sk="hit_song/N5" so="4"/>
</cs>
<dep c="0.692" dest="Alive" destLc="noun" role="modifier" src="hit song" srcLc="noun"/>
</frag>
. . .
<frag cccfmp="0.847" cccmp="0.675" ccfmp="0.787" len="1" so="7">
<txt>Alive</txt>
<cs pb="1.000" pc="0.874" sk="Alive/C1705" so="7">
<fs pb="1.000" pc="0.874" sk="Alive/N1705" so="7">
<collapsed sk="Alive/N12"/>
<collapsed sk="Alive/N14"/>
<collapsed sk="Alive/N15"/>
<collapsed sk="Alive/N16"/>
<collapsed sk="Alive/N21"/>
<collapsed sk="Alive/N75"/>
</fs>
</cs>
<dep c="0.428" dest="played" destLc="verb" role="argument" src="Alive" srcLc="noun" type="theme"/>
<dep c="0.692" dest="hit song" destLc="noun" role="head" src="Alive" srcLc="noun"/>
</frag>
. . .
</sent>
</para>
</doc>
The first fragment is a determiner of length one (one token long). Determiners or other closed class words do not have sense information. The matched text is included in element “txt”. As a convention, the trailing space is included with this element.
The second fragment is “band”, also of length one. The analysis yielded the coarse sense band/C2 and the fine sense band/N5. The other sense(s) of band/C2 were discarded. The positive confidence (“ccfmp”, “cccmp”, for fine and coarse senses, respectively) for the senses is very high. The probability (“p”) is 1.000 because that's the only answer present. The word is the agent of a syntactic dependency with the word “play”. For a description of the other attributes of element “frag”, please refer to the schema. Some of these attributes are positional (so, len) and others are related to the overall confidence for coarse/fine senses.
The third fragment is “play”. In this example, a single coarse sense was predicted but this sense includes two possible fine senses which are almost equiprobable.
The fifth fragment (“hit song”) is a compound spanning two tokens. In this case, the other interpretation as two distinct fragments of length one for “hit” and “song” were discarded during the analysis and all the probability was assigned to this fragment of length two. Had the software determined that probable enough to keep, two fragments of length 1 would have been present: The first one would contain a sense distribution for “hit” and “hit_song” and the second one would include a sense distribution for “song” and “hit_song”. The probabilities would be normalized based on the probability of each path. For example, if “hit_song” is 60% probable as a path, the various senses of “hit” and “song” would total 40% each.
The last fragment contains an example of a collapsed NE. The word “Alive” is the title for multiple songs in the Language Graph. After an initial analysis, it was determined that Alive/N12, N14, N15, N16, N21, and N75 were all possible given the context. They were replaced with the single Alive/N1705 during the process of disambiguation, and that solution was retained as the only one for this fragment.