Two weeks ago, Idilia released the sense-annotated Wikipedia Dataset, containing one million Wikipedia articles, each with its first 5000 words annotated.
Idilia’s Language Graph contains tens of millions of curated word senses and semantic relations in addition to millions of links to external references, including Wikipedia. To begin building the dataset, we first extracted all 2.5 million Wikipedia articles that are currently mapped to a sense in the Language Graph. The articles were then sorted by size, with the top one million largest articles being retained.
The first 5000 words of each article were then annotated using Sense Analysis. The annotations produced include parts of speech (both open and closed class words), word senses (both common words and named entities) and syntactic dependencies. In addition to WSD, we also used our existing high precision mappings between word senses and Wikipedia articles to increase the accuracy of the disambiguation. In other words, hyperlinks to Wikipedia articles can be automatically annotated with the correct sense using our pre-existing external mappings. An example of a sense annotated document can be found here.
The documents in the dataset are in SemDoc format.