HOME   CALL FOR TASKS CALL FOR INTEREST Workshop Submission Workshop PROGRAM NEWS   REGISTRATION TASKS (short list) TASKS (with rankings) DATA   [41 available]
LRE special issue  


Evaluation Exercises on Semantic Evaluation - ACL SigLex event



Available tasks:   7 

#3  Cross-Lingual Word Sense Disambiguation 

There is a general feeling in the WSD community that WSD should not be considered as an isolated research task, but should be integrated in real NLP applications such as Machine translation or multilingual IR. Using translations from a corpus instead of human defined (e.g. WordNet) sense labels, makes it easier to integrate WSD in multilingual applications, solves the granularity problem that might be task-dependent as well, is language-independent and can be a valid alternative for languages that lack sufficient sense-inventories and sense-tagged corpora.

We propose an Unsupervised Word Sense Disambiguation task for English nouns by means of parallel corpora. The sense label is composed of translations in the different languages and the sense inventory is built up by three annotators on the basis of the Europarl parallel corpus by means of a concordance tool. All translations (above a predefined frequency threshold) of a polysemous word are grouped into clusters/"senses" of that given word.

Languages: English - Dutch, French, German, Italian, Spanish


1. Bilingual Evaluation (English - Language X)

[English] ... equivalent to giving fish to people living on the [bank] of the river ...

Sense Label = {oever/dijk} [Dutch]
Sense Label = {rives/rivage/bord/bords} [French]
Sense Label = {Ufer} [German]
Sense Label = {riva} [Italian]
Sense Label = {orilla} [Spanish]

2. Multi-lingual Evaluation (English - all target languages)

... living on the [bank] of the river ...
Sense Label = {oever/dijk, rives/rivage/bord/bords, Ufer, riva, orilla}


As the task is formulated as an unsupervised WSD task, we will not annotate any training material. Participants can use the Europarl corpus that is freely available and that will be used for building up the sense inventory.
For the test data, native speakers will decide on the correct translation cluster(s) for each test sentence and give their top-3 translations from the predefined list of Europarl translations, in order to assign weights to the translations from the answer clusters for that test sentence.
Participants will receive manually annotated development and test data:
  • Development/sample data: 5 polysemous English nouns, each with 20 example instances
  • Test data: 20 polysemous English nouns (selected from the test data as used in the lexical substitution task), each with 50 test instances


The evaluation will be done using precision and recall. We will perform both a "best result" evaluation (the first translation returned by a system) and a more relaxed evaluation for the "top ten" results (the first ten translations returned by a system).

Organizers: Els Lefever and Veronique Hoste (University College Ghent, Belgium)
Web Site: http://webs.hogent.be/~elef464/lt3_SemEval.html

[ Ranking]

  • Test data availability: 22 March - 25 March , 2010
  • Result submission deadline: within 4 days after downloading the *test* data.

#11  Event Detection in Chinese News Sentences 

The goal of the task is to detect and analyze some basic event contents in real world Chinese news texts. It consists of finding key verbs or verb phrases to describe these events in the Chinese sentences after word segmentation and part-of-speech tagging, selecting suitable situation description formula for them, and anchoring different situation arguments with suitable syntactic chunks in the sentence. Three main sub-tasks are as follows:
  1. Target verb WSD: to recognize whether there are some key verbs or verb phrases to describe two focused event contents in the sentence, and select suitable situation description formula for these recognized key verbs (or verb phrases), from a situation network lexicon. The input of the sub-task is a Chinese sentence annotated with correct word-segmentation and POS tags. Its output is the sense selection or disambiguation tags of the target verbs in the sentence.
  2. Sentence SRL: to anchor different situation arguments with suitable syntactic chunks in the sentence, and annotate suitable syntactic constituent and functional tags for these arguments. Its input is a Chinese sentence annotated with correct word-segmentation, POS tags and the sense tags of the target verbs in the sentence. Its output is the syntactic chunk recognition and situation argument anchoring results.
  3. Event detection: to detect and analyze the special event content through the interaction of target verb WSD and sentence SRL. Its input is a Chinese sentence annotated with correct word-segmentation and POS tags. Its output is a complete event description detected in the sentence (if it has a focused target verb).
The following is a detailed example to explain the above procedure: For such a Chinese sentence after word-segmentation and POS tagging:

今天/n(Today) 我/r(I) 在/p(at) 书店/n(bookstore) 买/v(buy) 了/u(-ed) 三/m(three) 本/q 新/a(new) 书/n(book) 。/w (Today, I bought three new books at the bookstore.)

After the first processing stage: target verb WSD, we find there is a possession-transferring verb ‘买/v(buy)’ in the sentence and select the following situation description formula for it:

买/v(buy): DO(x, P(x,y)) CAUSE have(x,y) AND NOT have(z,y) [P=buy]
Then, we anchor four situation arguments with suitable syntactic chunks in the sentence and obtain the following sentence SRL result:

今天/n(Today) [S-np 我/r(I) ]x [D-pp 在/p(at) 书店/n(bookstore) ]z [P-vp 买/v(buy) 了/u(-ed) ]Tgt [O-np 三/m(three) 本/q 新/a(new) 书/n(book) ]y 。/w[2]

Finally, we can get the following situation description for the sentence:

DO(x, P(x,y)) CAUSE have(x,y) AND NOT have(z,y) [x=我/r(I), y=三/m(three) 本/q 新/a(new) 书/n(book), z=书店/n(bookstore), P=买/v(buy)]

Organizers: Qiang Zhou (Tsinghua University, Beijing, China)
Web Site: http://www.ncmmsc.org/SemEval-2010-Task/

[ Ranking]

#14  Word Sense Induction 


This task is a continuation of the WSI task (i.e. Task 2) of SemEval 2007 (nlp.cs.swarthmore.edu/semeval/tasks/task02/summary.shtml) with
some significant changes to the evaluation setting.

Word Sense Induction (WSI) is defined as the process of identifying the different senses (or uses) of a target word in a given text in an automatic and fully-unsupervised manner. The goal of this task is to allow comparison of unsupervised sense induction and disambiguation systems. A secondary outcome of this task will be to provide a comparison with current supervised and knowledge-based methods for sense disambiguation. The evaluation scheme consists of the following assessment methodologies:

  • Unsupervised Evaluation. The induced senses are evaluated as clusters of examples, and compared to sets of examples, which have been tagged with gold standard (GS) senses. The evaluation metric used, V-measure (Rosenberg & Hirschberg, 2007), attempts to measure both coverage and homogeneity of a clustering solution, where a perfect homogeneity is achieved if all the clusters of a clustering solution contain only data points, which are elements of a single Gold Standard (GS) class. On the other hand, a perfect coverage is achieved if all the data points, which are members of a given class are also elements of the same cluster. Homogeneity and completeness can be treated in similar fashion to precision and recall, where increasing the former often results in decreasing the latter (Rosenberg & Hirschberg, 2007).

  • Supervised Evaluation. The second evaluation setting, supervised evaluation, assesses WSI systems in a WSD task. A mapping is created between induced sense clusters (from the unsupervised evaluation described above) and the actual GS senses. The mapping matrix is then used to tag each instance in the testing corpus with GS senses. The usual recall/precision measures for WSD are then used. Supervised evaluation was a part of the SemEval-2007 WSI task (Agirre & Soroa,2007).

Andrew Rosenberg and Julia Hirschberg. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) Prague, Czech Republic, (June 2007). ACL.

Eneko Agirre and Aitor Soroa. Semeval-2007 task 02: Evaluating word sense induction and discrimination systems. In Proceedings of the Fourth International Workshop on Semantic Evaluations, pp. 7-12, Prague, Czech Republic, (June 2007). ACL.

Organizers: Suresh Manandhar (University of York), Ioannis Klapaftis (University of York), and Dmitriy Dligach, (University of Colorado)
Web Site: http://www.cs.york.ac.uk/semeval2010_WSI/

[ Ranking]

#15  Infrequent Sense Identification for Mandarin Text to Speech Systems 

There are seven cases of grapheme to phoneme (GTP) in a text to speech (TTS) system (Yarowsky, 1997). Among them, the most difficult task is disambiguating the homograph word, which has the same POS (part of speech) but different pronunciation. In this case, different pronunciations of the same word always correspond to different word senses. Once the word senses are disambiguated, the problem of GTP is resolved.

There is a little different from traditional WSD (word sense disambiguation), in this task two or more senses may correspond to one pronunciation. That is, the sense granularity is coarser than WSD. For example, the preposition “为” has three senses: sense1 and sense2 have the same pronunciation {wei 4}, while sense3 corresponds to {wei 2}. In this task, to the target word, not only the pronunciations but also the sense labels are provided for training; but for test, only the pronunciations are evaluated. The challenge of this task is the much skewed distribution in real text: the most frequent pronunciation occupies usually over 80%.

In this task, we will provide a large volume of training data (each homograph word has at least 300 instances) accordance with the truly distribution in real text. In the test data, we will provide at least 100 instances for each target word. In order to focus on the performance of identifying the infrequent sense, we will intentionally divide the infrequent pronunciation instances and frequent instances half and half in the test dataset. The evaluation method compiles with the precision vs. recall evaluation.

All instances come from People Daily newspaper (the most popular newspaper in Mandarin). Double blind annotations are executed manually, and a third annotator checks the annotation.

Yarowsky, David (1997). “Homograph disambiguation in text-to-speech synthesis.” In van Santen, Jan T. H.; Sproat, Richard; Olive, Joseph P.; and Hirschberg, Julia. Progress in Speech Synthesis. Springer-Verlag, New York, 157-172.

Organizers: Peng Jin, Yunfang Wu and Shiwen Yu (Peking University Beijing, China)
Web Site:

[ Ranking]


  • Test data release: March 25, 2010
  • Result submission deadline: March 29, 2010,
  • Organizers send the test results: April 2, 2010

#16  Japanese WSD 

This task can be considered an extension of SENSEVAL-2 JAPANESE LEXICAL SAMPLE Monolingual dictionary-based task. Word senses are defined according to the Iwanami Kokugo Jiten, a Japanese dictionary published by Iwanami Shoten. Please refer to that task for reference. We think that our task has the following two new characteristics:

1) All previous Japanese sense-tagged corpora were from newspaper articles, while sense-tagged corpora have been constructed in English on balanced corpora, such as Brown corpus and BNC corpus. The first balanced corpus of contemporary written Japanese (BCCWJ corpus) is now being constructed as part of a national project in Japan [Maekawa, 2008], and we are now constructing a sense-tagged corpus on it. Therefore, the task will use the first balanced Japanese sense-tagged corpus.

2) In previous WSD tasks, systems have been required to select a sense from a given set of senses in a dictionary for a word in one context (an instance). However, the set of senses in the dictionary is not always complete. New word senses sometimes appear after the dictionary has been compiled. Therefore, some instances might have a sense that cannot be found in a set in the dictionary. The task will take into account not only the instances having a sense in the given set but also the instances having a sense that cannot be found in the set. In the latter case, systems should output that the instances have a sense that is not in the set.

Organizers: Manabu Okumura (Tokyo Institute of Technology), Kiyoaki Shirai (Japan Advanced Institute of Science and Technology)
Web Site: http://lr-www.pi.titech.ac.jp/wsd.html

#17  All-words Word Sense Disambiguation on a Specific Domain (WSD-domain) 

Domain adaptation is a hot issue in Natural Language Processing, including Word Sense Disambiguation. WSD systems trained on general corpora are known to perform worse when moved to specific domains. WSD-domain task will offer a testbed for domain-specific WSD systems, and will allow to test domain portability issues.

Texts from ECNC and WWF will be used in order to build domain specific test copora (see example below). The data will be available in a number of languages: English, Dutch and Italian, and possibly Basque and Chinese (confirmation pending). The sense inventories will be based on wordnets of the respective languages.

The test data will comprise three documents (6000 word chunk with approx. 2000 target words) for each language. The test data will be annotated by hand using double-blind annotation plus adjudication. Inter-Tagger Agreement will be measured. There will not be training data available, but participants are free to use existing hand-tagged corpora and lexical resources. Traditional precision and recall measures will be used in order to evaluate the participant systems, as implemented in past WSD Senseval and SemEval tasks.

WSD-domain is being developed in the framework of the Kyoto project (http://www.kyoto-project.eu/).

Environment domain text example:
"Projections for 2100 suggest that temperature in Europe will have risen by between 2 to 6.3 °C above 1990 levels. The sea level is projected to rise, and a greater frequency and intensity of extreme weather events are expected. Even if emissions of greenhouse gases stop today, these changes would continue for many decades and in the case of sea level for centuries. This is due to the historical build up of the gases in the atmosphere and time lags in the response of climatic and oceanic systems to changes in the atmospheric concentration of the gases."

Organizers: Eneko Agirre and Oier Lopez de Lacalle (Basque Country University)
Web Site: http://xmlgroup.iit.cnr.it/SemEval2010/


  • Test data release: March 26
  • Closing competition : April 2

#18  Disambiguating Sentiment Ambiguous Adjectives 

Some adjectives are neutral in sentiment polarity out of context, but they show positive, neutral or negative meaning within specific context. Such words can be called dynamic sentiment ambiguous adjectives. For instance, “价格高|the price is high” indicates negative meaning, while “质量高|the quality is high” has positive connotation. Disambiguating sentiment ambiguous adjectives is an interesting task, which is an interaction between word sense disambiguation and sentiment analysis. However in the previous works, sentiment ambiguous words have not been tackled in the field of WSD, and are also discarded crudely by most of the researches concerning sentiment analysis.

This task aims to create a benchmark dataset for disambiguating dynamic sentiment ambiguous adjectives. The sentiment ambiguous words are pervasive in many languages. In this task we concentrate on Chinese, but we think, the disambiguating techniques should be language-independent. Together 14 dynamic sentiment ambiguous adjectives are selected, which are all high-frequency words in Mandarin Chinese. They are: 大|big, 小|small, 多|many, 少|few, 高|high, 低|low, 厚|thick, 薄|thin, 深|deep, 浅|shallow, 重|heavy, 轻|light, 巨大|huge, 重大|grave.

The dataset contains two parts. Some sentences containing these target adjectives will be extracted from Chinese Gigaword (LDC corpus: LDC2005T14). And the other sentences will be gathered through the search engine like Google. Firstly these sentences will be automatically segmented and POS-tagged. And then the ambiguous adjectives are manually annotated with the correct sentiment polarity within the sentence context. Two human annotators will annotate the sentences double blindly. The third annotator will check the annotation.

This task will be carried out in an unsupervised setting, and consequently no training data will be provided. All the data of about 4,000 sentences will be provides as the test set. Evaluation will be performed in terms of the usual precision, recall and F1 scores.

Organizers: Yunfang Wu, Peng Jin, Miaomiao Wen and Shiwen Yu (Peking University, Beijing, China)
Web Site:

[ Ranking]


  • Test data release: March 23, 2010
  • Result submission deadline: postponed at March 27, 2010, 4 days after downloading the test data
  • Organizers send the test results: April 2, 2010

© 2008 FBK-irst  |  internal area