Automated Knowledge Extraction
Commercially viable WSD requires a semantic knowledge base containing an inventory of word meanings and semantic relations between them, sufficient to cover the vast majority of word senses, including proper noun sense, found in general domain texts. Proper nouns are a special case, as there are literally millions of them; for example, names of people, companies, products, and places. Idilia’s semantic analysis technology incorporates a massive general purpose semantic knowledge base, called the Language Graph, consisting of millions of proper and common word meanings, as well as tens of millions of precise semantic relations of different types connecting them. New proper noun senses are created all the time. Idilia has therefore researched and developed new knowledge acquisition technology for automatically mining new terms, and anchoring them in the general-purpose knowledge base. More specifically, Idilia’s knowledge extraction technology includes:
- A massive linguistic knowledge base containing over 9.5 million senses (representing over 9 million proper nouns and 300,000 common word senses), connected by 100 million edges. By way of comparison, Wikipedia currently contains less than 4 million articles.
- Coarse and fine senses supporting different levels of WSD granularity;
- Technology for automatically extracting new terminology and semantic relations from unstructured and semi-structured sources;
- Algorithms for generating and testing plausible variants of extracted terms;
- Methods for precisely linking new terminology into the general-purpose knowledge base, building sub-ontologies whenever required. This means that the knowledge base can be constantly updated as new terminology is invented.
- Linguistic knowledge management system capable of representing, efficiently manipulating a massive number and wide variety of possible semantic relationships;
- Tools for visualizing semantic knowledge, as well as for manually validating and editing acquired linguistic knowledge and inputting additional knowledge.