Sense Annotated Datasets

Overview

Idilia is providing several datasets to facilitate the development of word sense applications. The datasets can be used to build models or run experiments. They consist of several sense-annotated documents.

Download and Installation

The corpora are provided as split compressed tar files of 1GB each. For example: rcv1.semdoc.tgz.000, rcv1.semdoc.tgz.001, rcv1.semdoc.tgz.002, etc.

Each one is stored in a folder that contains an “index.html” file that lists all the partitions to download. Retrieve them using a program such as “wget”:

wget -r <URL>

The downloaded files can be joined and uncompressed using:

cat *.semdoc.tgz.* | tar -xz

Viewing Documents

Each document is an XML file in semdoc format. This format supports ambiguity, dependencies, etc. (For more on this format, see Understanding the Semdoc Format.) Documents can be viewed inside a browser by transforming them into html using a stylesheet. The output is the same as what is shown for Sense Analysis Sample Results.

  • Download the XSLT stylesheet semdoc_visualization.xsl
  • Transform a document: xsltproc -o doc.html semdoc_visualization.xsl any.semdoc.xml
  • Open doc.html in a browser.

Reuters Classification Corpora

Classification experiments frequently use the Reuters 21578 and RCV1 corpus. Both have been processed and are available for download in semdoc format. They consist of newswire texts from Reuters for which the first 5000 tokens were processed. The sense annotations for the RCV1 corpus source are more accurate than for the 21578 corpus because the original HTML source was available.

These versions do not contain the text elements and cannot be used to reconstruct the original documents. This is to respect their copyright. They do contain the senses and dependencies in the correct word ordering. However if you have legal access to these corpora, you may contact us to obtain access to versions with the text and senses. (To obtain legal access, please see NIST.)

Source Tokens DL Size Disk Size Download command
21578 3.1M 175M 1.2G wget -r http://download.idilia.com/datasets/21578/index.html
RCV1 228M 12G 80G wget -r http://download.idilia.com/datasets/rcv1/index.html

TREC Enterprise 2008

The TREC Enterprise 2008 corpus is the content of the website of the Australian Commonwealth Scienti c and Industrial Research
Organisation as of March 2007. It can be used in search applications. See Text REtrieval Conference (TREC).

Most of the documents are well formatted with very good sense analysis results but some are not. Some care must be exercised when building models from these.

This version does not contain the text elements and cannot be used to reconstruct the original documents. This is to respect their copyright. They do contain the senses and dependencies in the correct word ordering. However if you have legal access to this corpus, you may contact us to obtain access to a version with the text and senses. (To obtain legal access, please see CSIRO.)

Source Tokens DL Size Disk Size Download command
CSIRO 143M 5.4G 34G wget -r http://download.idilia.com/datasets/trec/index.html

Wikipedia

This dataset contains one million Wikipedia articles for which the first 5000 tokens were processed. We took advantage of the Idilia-Wikipedia mapping to automatically insert the result word sense where possible.

Source Tokens DL Size Disk Size Download command
Wikipedia >2.5B 46G 300G wget -r http://download.idilia.com/datasets/wikipedia/index.html

Twitter

We are considering building a corpora with millions of Tweets. Please let us know if interested.