HOME   CALL FOR TASKS CALL FOR INTEREST Workshop Submission Workshop PROGRAM NEWS   REGISTRATION TASKS (short list) TASKS (with rankings) DATA   [41 available]
LRE special issue  


Evaluation Exercises on Semantic Evaluation - ACL SigLex event



Available tasks:   18 

#1  Coreference Resolution in Multiple Languages 

Using coreference information has been shown to be beneficial in a number of NLP applications including Information Extraction, Text Summarization, Question Answering and Machine Translation. This task is concerned with intra-document coreference resolution for six different languages: Catalan, Dutch, English, German, Italian and Spanish. The complete task is divided into two subtasks for each of the languages:
  1. Detection of full coreference chains, composed by named entities, pronouns, and full noun phrases.
  2. Pronominal resolution, i.e., finding the antecedents of the pronouns in the text.


Data is provided for both statistical training and evaluation, which extract the coreference chains from manually annotated corpora: the AnCora corpora for Catalan and Spanish, the OntoNotes corpus for English, the TüBa-D/Z for German, the KNACK corpus for Dutch, and the LiveMemories corpus for Italian, additionally enriched with morphological, syntactic and semantic information (such as gender, number, constituents, dependencies, predicates, etc.). Great effort has been devoted to provide the participants with a common and relatively simple data representation for all the languages.


The main goal is to perform and evaluate coreference resolution for six different languages with the help of other layers of linguistic information and using different evaluation metrics (MUC, B-CUBED, CEAF and BLANC).
  1. The multilingual context will allow to study the portability of coreference resolution systems across languages. To what extent is it possible to implement a general system that is portable to all six languages? How much language-specific tuning is necessary? Are there significant differences between Germanic and Romance languages? And between languages of the same family?
  2. The additional layers of annotation will allow to study how helpful morphology, syntax and semantics are to solve coreference relations. How much preprocessing is needed? How much does the quality of the preprocessing modules (perfect linguistic input vs. noisy automatic input) affect the performance of state-of-the-art coreference resolution systems? Is morphology more helpful than syntax? Or semantics? Or is syntax more helpful than semantics?
  3. The use of four different evaluation metrics will allow to compare the advantages and drawback of the generally used MUC, B-CUBED and CEAF measures, as well as the newly proposed BLANC measure. Do all of them provide the same ranking? Are they correlated? Can systems be optimized under all four metrics at the same time?


Two different scenarios will be considered for evaluation. In the first one, gold‐standard annotation will be provided to participants (up to full syntax and possibly including also semantic role labeling). This input annotation will correctly identify all noun phrases that are part of the coreference chains. In the second scenario we will use state‐ of‐the‐art automatic linguistic tools to generate the input annotation of the data. In this second scenario, the matching between the automatically generated structure and the real NPs intervening in the chains does not need to be perfect. By defining these two experimental settings, we will be able to check the effectiveness of state‐of‐the‐art coreference resolution systems when working with perfect linguistic (syntactic/semantic) information and the degradation in performance when moving to a realistic scenario. In parallel, we will also differentiate between closed and open settings, that is, when participants are allowed to use strictly the information contained in the training data (closed) and when they make use of some external resources/tools (open).

Organizers: Veronique Hoste, Lluis Marquez, M. Antonia Marti, Massimo Poesio, Marta Recasens, Emili Sapena, Mariona Taule, Yannick Versley. (Universitat de Barcelona, Universitat Politècnica de Catalunya, Hogeschool Gent, Università di Trento, Universität Tübingen)
Web Site: http://stel.ub.edu/semeval2010-coref/

[ Ranking]

  • Training data release : February 11th
  • Test data release: March 20th
  • Time constraint: Upload the results no more than 7 days after downloading the test set
  • Closing competition : April 2nd

#2  Cross-Lingual Lexical Substitution 


The goal of this task is to provide a framework for the evaluation of systems for cross-lingual lexical substitution. Given a paragraph and a target word, the goal is to provide several correct translations for that word in a given language, with the constraint that the translations fit the given context in the source language. This is a follow-up of the English lexical substitution task from SemEval-2007 (McCarthy and Navigli, 2007), but this time the task is cross-lingual.

While there are connections between this task and the task of automatic machine translation, there are several major differences. First, cross-lingual lexical substitution targets one word at a time, rather than an entire sentence as machine translation does. Second, in cross-lingual lexical substitution we seek as many good translations as possible for the given target word, as opposed to just one translation, which is the typical output of machine translation. There are also connections between this task and a word sense disambiguation task which uses distinctions in translations for word senses (Resnik and Yarowsky, 1997) however in this task we do not restrict the translations to those in a specific parallel corpus; the annotators and systems are free to choose the translations from any available resource. Also, we do not assume a fixed grouping of translations to form "senses" and so it is possible that any token instance of a word may have translations in common with other token instances that are not themselves directly related.

Given a paragraph and a target word, the task is to provide several correct translations for that word in a given language. We will use English as the source language and Spanish as the target language.


Organizers: Rada Mihalcea (University of North Texas), Diana McCarthy (University of Sussex), Ravi Sinha (University of North Texas)
Web Site: http://lit.csci.unt.edu/index.php/Semeval_2010

[ Ranking]

  • Test data availability: 1 March - 2 April , 2010
  • Result submission deadline: within 7 days after downloading the *test* data.
  • Closing competition for this task: 2 April

#3  Cross-Lingual Word Sense Disambiguation 

There is a general feeling in the WSD community that WSD should not be considered as an isolated research task, but should be integrated in real NLP applications such as Machine translation or multilingual IR. Using translations from a corpus instead of human defined (e.g. WordNet) sense labels, makes it easier to integrate WSD in multilingual applications, solves the granularity problem that might be task-dependent as well, is language-independent and can be a valid alternative for languages that lack sufficient sense-inventories and sense-tagged corpora.

We propose an Unsupervised Word Sense Disambiguation task for English nouns by means of parallel corpora. The sense label is composed of translations in the different languages and the sense inventory is built up by three annotators on the basis of the Europarl parallel corpus by means of a concordance tool. All translations (above a predefined frequency threshold) of a polysemous word are grouped into clusters/"senses" of that given word.

Languages: English - Dutch, French, German, Italian, Spanish


1. Bilingual Evaluation (English - Language X)

[English] ... equivalent to giving fish to people living on the [bank] of the river ...

Sense Label = {oever/dijk} [Dutch]
Sense Label = {rives/rivage/bord/bords} [French]
Sense Label = {Ufer} [German]
Sense Label = {riva} [Italian]
Sense Label = {orilla} [Spanish]

2. Multi-lingual Evaluation (English - all target languages)

... living on the [bank] of the river ...
Sense Label = {oever/dijk, rives/rivage/bord/bords, Ufer, riva, orilla}


As the task is formulated as an unsupervised WSD task, we will not annotate any training material. Participants can use the Europarl corpus that is freely available and that will be used for building up the sense inventory.
For the test data, native speakers will decide on the correct translation cluster(s) for each test sentence and give their top-3 translations from the predefined list of Europarl translations, in order to assign weights to the translations from the answer clusters for that test sentence.
Participants will receive manually annotated development and test data:
  • Development/sample data: 5 polysemous English nouns, each with 20 example instances
  • Test data: 20 polysemous English nouns (selected from the test data as used in the lexical substitution task), each with 50 test instances


The evaluation will be done using precision and recall. We will perform both a "best result" evaluation (the first translation returned by a system) and a more relaxed evaluation for the "top ten" results (the first ten translations returned by a system).

Organizers: Els Lefever and Veronique Hoste (University College Ghent, Belgium)
Web Site: http://webs.hogent.be/~elef464/lt3_SemEval.html

[ Ranking]

  • Test data availability: 22 March - 25 March , 2010
  • Result submission deadline: within 4 days after downloading the *test* data.

#4  VP Ellipsis - Detection and Resolution 


Verb Phrase Ellipsis (VPE) occurs in the English language when an auxiliary or modal verb abbreviates an entire verb phrase recoverable from the linguistic context, as in the following examples:

  • Both Dr. Mason and Dr. Sullivan [oppose federal funding for abortion], as does President Bush, except in cases where a woman's life is threatened.
  • They also said that vendors were [delivering goods] more quickly in October than they had for each of the five previous months.
  • He spends his days [sketching passers-by], or trying to.
Here occurrences of VPE are typeset in a bold face font. The antecedent is marked by square brackets.

The Task

The proposed shared task consists of two subtasks: (1) automatically detecting VPE in free text; and (2) selecting the textual antecedent of each found VPE. Task 1 is reasonably difficult (Nielsen 2004 reports an F-score of 71% on Wall Street Journal data).

Task 2 is challenging. With a "head match" evaluation Hardt 1997 reports a success rate of 62% for a baseline system based on recency only, and an accurracy of 84% for an improved system taking recency, clausal relations, parallelism, and quotation into account. We will make the task more realistic (but more difficult) by not using head match but rather precision and recall over each token of the antecedent.

We will provide texts where sentence boundaries are detected and each sentence is tokenised and printed on a new line. An occurrence of VPE is marked by a line number plus token positions of the auxiliary or modal verb. Textual antecedents are assumed to be on one line, and are marked by the line number plus begin/end token position.

The Data

As development data we will provide the stand-off annotation of more than 500 occurrences of manually annotated VPE in the Wall Street Journal part (all 25 sections) of the Penn Treebank. We have made an arrangement with the Linguistic Data Consortium that participants without access to the Penn Treebank can use the raw texts for the duration of the shared task.

We will also produce a script that calculates precision and recall of detection and the average F-score and accuracy of antecedent selection based on overlap with a gold standard antecedent.

The test data will be a further collection of newswire (or similar genre) articles. The "gold" standard of the test data will be determined by using the merged results of all task participants. Additionally, these will be manually judged by the organisers.


Daniel Hardt (1997): An Empirical Approach to VP Ellipsis. Computational Linguistics 23(4).

Leif A. Nielsen (2004): Verb phrase ellipsis detection using automatically parsed text. Proceedings of the 20th international Conference on Computational Linguistics (Geneva, Switzerland).

Organizers: Johan Bos (University of Rome "La Sapienza") and Jennifer Spenader (University of Groningen)
Web Site: http://www.sigsem.org/wiki/SemEval_2010:_VP_Ellipsis_Processing

#5  Automatic Keyphrase Extraction from Scientific Articles 

Keyphrases are words that capture the main topic of the document. As keyphrases represent the key ideas of documents, extracting good keyphrases benefits various natural language processing (NLP) applications, such as summarization, information retrieval (IR) and question-answering (QA). In summarization, the keyphrases can be used as a semantic metadata. In search engines, keyphrases can supplement full-text indexing and assist users in creating good queries. Therefore, the quality of keyphrases has a direct impact on the quality of downstream NLP applications.

Recently, several systems and techniques have been proposed to extract keyphrases. Hence, we propose a shared task in order to provide the chance to compete and benchmark such technologies.

In the shared task, the participants will be provided with set of scientific articles and will be asked to produce the keyphrases for each article.

The organizers will provide trial, train and test data. The average length of the articles is between 6 and 8 pages including tables and pictures. We will provide two sets of answers: author-assigned keyphrases and reader-assigned keyphrases. All reader-assigned keyphrases will be extracted from the papers whereas some of author-assigned keyphrases may not occur in the content.

The answer set contains lemmatized keyphrases. We also accept two alternation of keyphrase: A of B -> B A (e.g. policy of school = school policy) and A's B (e.g. school's policy = school policy). However, in case that the semantics has been changed due to the alternation, we do not include the alternation as the answer set.

In this shared task, we follow the traditional evaluation metric. That is, we match the keyphrases in the answer sets (i.e. author-assigned keyphrases and reader-assigned keyphrases) with those participants provide and calculate precision, recall and F-score. Then finally, we will rank the participants by F-score.

The Google-group for the task is at http://groups.google.com.au/group/semeval2010-keyphrase?lnk=gcimh&pli=1

Organizers: Su Nam Kim (University of Melbourne), Olena Medelyan (University of Waikato), Min-yen Kan (National University of Singapore), Timothy Baldwin (University of Melbourne)
Web Site: http://docs.google.com/Doc?id=ddshp584_46gqkkjng4

[ Ranking]


  • Test and training data release : Feb. 15th (Monday)
  • Closing competition : March 19th (5 weeks for competition) (Friday)
  • Results out : by March 31st
  • Submission of description papers: April 17, 2010
  • Notification of acceptance: May 6, 2010
  • Workshop: July 15-16, 2010 ACL Uppsala

#6   Classification of Semantic Relations between MeSH Entities in Swedish Medical Texts  


Task cancelled

There is a growing interest and, consequently, a volume of publications related to the topic of relation classification in the medical domain. Algorithms for classifying semantic relations have potential applications in many language technology applications and there has been a renewed interest during the last years. If such semantic relations can be determined, the potential of obtaining more accurate results for systems and applications such as Information Retrieval and Extraction, Summarization, Question Answering, etc. increases, particularly since searching to mere co-occurrence of terms is unfocused and does not by any means guarantee that there can be a relation between the identified terms of interest. For instance, knowing the relationship that prevails between a medication and a disease or symptom should be useful for searching free text and easier obtaining answers to questions such as “What is the effect of treatment with substance X to the disease Y?”, Our task "Classification of Semantic Relations between MeSH Entities in Swedish Medical Texts" deals with the classification of semantic relations between pairs of MeSH entities/annotations. We focus on three entity types: DISEASES/SYMPTOMS (category C in the MeSH hierarchy), CHEMICAL and DRUGS/ANALYTICAL, DIAGNOSTIC AND THERAPEUTIC TECHNIQUES AND EQUIPMENT (categories D and E in the MeSH hierarchy). The evaluation task is similar to the SEMEVAL-1/Task#4 by Girju et al.: Classification of Semantic Relations between Nominals. This implies that the evaluation methodology to be used will include similar evaluation criteria already developed (the SEMEVAL-1/Task#4). The datasets for the task will consist of annotated sentences with relevant MeSH entities, including the surrounding context for the investigated entities and their relation within a window size of one to two preceding and one to two following sentences. We plan to have about nine semantic relations with approx. 100-200 training sentences and 50-100 testing sentences per relation.

Organizers: Dimitrios Kokkinakis (University of Gothenburg), Dana Dannells (University of Gothenburg), Hercules Dalianis (Stockholm University)
Web Site: http://demo.spraakdata.gu.se/svedk/semeval/

#7  Argument Selection and Coercion 


Task Description

This task involves identifying the compositional operations involved in argument selection. Most annotation schemes to date encoding propositional or predicative content have focused on the identification of the predicate type, the argument extent, and the semantic role (or label) assigned to that argument by the predicate. In contrast, this task attempts to capture the "compositional history" of the argument selection relative to the predicate. In particular, this task attempts to identify the operations of type adjustment induced by a predicate over its arguments when they do not match its selectional properties. The task is defined as follows: for each argument of a predicate, identify whether the entity in that argument position satisfies the type expected by the predicate. If not, then one needs to identify how the entity in that position satisfies the typing expected by the predicate; that is, to identify the source and target types in a type-shifting (or coercion) operation. The possible relations between the predicate and a given argument will, for this task, be restricted to selection and coercion. In selection, the argument NP satisfies the typing requirements of the predicate. For example, in the sentence "The child threw the ball", the object NP "the ball" directly satisfies the type expected by the predicate, Physical Object. If this is not the case, then a coercion has occurred. For example, in the sentence "The White House denied this statement.", the type expected in subject position by the predicate is Human, but the surface NP is typed as Location. The task is to identify both the type mismatch and the type shift; namely Location -> Human.

Resources and Corpus Development

The following methodology will be followed in corpus creation: (1) A set of selection contexts will be chosen; (2) A set of sentences will be randomly selected for each chosen context; (3) The target noun phrase will be identified in each sentence, and a composition type determined in each case; (4) In cases of coercion, the source and target types for the semantic head of each relevant noun phrase will be identified. We will perform double annotation and adjudication over the corpus.

Evaluation Methodology

Precision and recall will be used as evaluation metrics. A scoring program will be supplied for participants. Two subtasks will be evaluated separately: (1) identifying the argument type and (2) identifying the compositional operation (i.e. selection vs. coercion).


J. Pustejovsky, A. Rumshisky, J. L. Moszkowicz, and O. Batiukova. 2009. Glml: Annotating argument selection and coercion. IWCS-8.

Organizers: James Pustejovsky, Nicoletta Calzolari, Anna Rumshisky, Jessica Moszkowicz, Elisabetta Jezek, Valeria Quochi, Olga Batiukova
Web Site: http://asc-task.org/


  • 11/10/09 - Trial data for English and Italian posted
  • 3/10/10 - Training data for English and Italian released
  • 3/27/10 - Test data for English and Italian released
  • 4/02/10 - Closing competition

#8  Multi-Way Classification of Semantic Relations Between Pairs of Nominals 

Recently, the NLP community has shown a renewed interest in deeper semantic analyses, among them automatic recognition of semantic relations between pairs of words. This is an important task with many potential applications including but not limited to Information Retrieval, Information Extraction, Text Summarization, Machine Translation, Question Answering, Paraphrasing, Recognizing Textual Entailment, Thesaurus Construction, Semantic Network Construction, Word Sense Disambiguation, and Language Modelling.

Despite the interest, progress was slow due to incompatible classification schemes, which made direct comparisons hard. In addition, most datasets provided no context for the target relation, thus relying on the assumption that semantic relations are largely context-independent, which is often false. A notable exception is SemEval-2007 Task 4 (Girju&al.,2007), which for the first time provided a standard benchmark dataset for seven semantic relations in context. However, this dataset treated each relation separately, asking for positive vs. negative classification decisions. While some subsequent publications tried to use the dataset in a multi-way setup, it was not designed to be used in that manner.

We believe that having a freely available standard benchmark dataset for *multi-way* semantic relation classification *in context* is much needed for the overall advancement of the field. That is why we pose as our primary objective the task of preparing and releasing such a dataset to the research community.

We will use nine mutually exclusive relations from Nastase & Szpakowicz (2003). Тhe dataset for the task will consist of annotated sentences, gathered from the Web and manually marked -- with indicated nominals and relations. We will provide 1000 examples for each relation, which is  a sizeable increase over the SemEval-2007 Task 4, where there were about 210 examples for each of the seven relations. There will be also a NONE relation, for which we will have 1000 examples as well.

Using that dataset, we will set up a common evaluation task that will enable researchers to compare their algorithms. The official evaluation score will be average F1 over all relations, but we will also check whether some relations are more difficult to classify than others, and whether some algorithms are best suited for certain types of relations. Trial data and an automatic scorer will be made available well in advance (by June 2009). All data will be released under a Creative Commons license.

Organizers: Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, Stan Szpakowicz. Contact: Preslav Nakov.
Web Site: http://docs.google.com/View?docid=dfvxd49s_36c28v9pmw

[ Ranking]

  • Trial data released : August 30, 2009
  • Training data release: February 26 , 2010
  • Test data release: March 18 , 2010
  • Result submission deadline: within seven days after downloading the *test* data, but not later than April 2
  • Organizers send test results: April 10, 2010

#9  Noun Compound Interpretation Using Paraphrasing Verbs 


Noun compounds -- sequences of nouns acting as a single noun, e.g. colon cancer -- are abundant in English. Understanding their syntax and semantics is challenging but important for many NLP applications, including but not limited to Question Answering, Machine Translation, Information Retrieval and Information Extraction. For example, a question-answering system might need to determine whether protein acting as a tumor suppressor is a good paraphrase for tumor suppressor protein, and an information extraction system might need to decide whether neck vein thrombosis and neck thrombosis could possibly co-refer when used in the same document. Similarly, a machine translation system facing the unknown noun compound WTO Geneva headquarters might benefit from being able to paraphrase it as Geneva headquarters of the WTO or as WTO headquarters located in Geneva. Given a query like "migraine treatment", an information retrieval system could use paraphrasing verbs like relieve and prevent for page ranking and query refinement.

We will explore the idea of using paraphrasing verbs and prepositions for noun compound interpretation. For example, nut bread can be paraphrased using verbs like contain and include, prepositions like with, and verbs+prepositions like be made from. Unlike traditional abstract relations such as CAUSE, CONTAINER, and LOCATION, verbs and prepositions are directly usable as paraphrases, and using several of them simultaneously yields an appealing fine-grained semantic representation.

We will release as trial/development data paraphrasing verbs and prepositions for 250 compounds, manually picked by 25-30 human subjects. For example, for nut bread we have the following paraphrases (the number of subjects who proposed each paraphrase is in parentheses):

contain(21); include(10); be made with(9); have(8); be made from(5); use(3); be made using(3); feature(2); be filled with(2); taste like(2); be made of(2); come from(2); consist of(2); hold(1); be composed of(1); be blended with(1); be created out of(1); encapsulate(1); diffuse(1); be created with(1); be flavored with(1), ...

Given a compound and a set of paraphrasing verbs and prepositions, the participants must provide a ranking that is as close as possible to the one proposed by human raters. Trial data and an automatic scorer will be made available well in advance (by June 2009). All data will be released under a Creative Commons license.

Organizers: Ioanacristina Butnariu, Su Nam Kim, Preslav Nakov, Diarmuid Ó Séaghdha, Stan Szpakowicz, Tony Veale. Contact: Preslav Nakov
Web Site: http://docs.google.com/View?docid=dfvxd49s_35hkprbcpt

[ Ranking]

  • Trial data released : August 30, 2009
  • Training data release: February 17 , 2010
  • Test data release: March 18 , 2010
  • Result submission deadline: within seven days after downloading the *test* data, but not later than April 2
  • Organizers send test results: April 10, 2010

#10  Linking Events and their Participants in Discourse 


Semantic role labelling (SRL) has traditionally been viewed as a sentence-internal problem. However, it is clear that there is an interplay between local semantic argument structure and the surrounding discourse. In this shared task, we would like to take SRL of nominal and verbal predicates beyond the domain of isolated sentences by linking local semantic argument structures to the wider discourse context. In particular, we aim to find fillers for roles which are left unfilled in the local context (null instantiations, NIs). An example is given below, where the "charges" role ("arg2" in PropBank) of cleared is left empty but can be linked to murder in the previous sentence.

In a lengthy court case the defendant was tried for murder. In the end, he was cleared.


There will be two tasks, which will be evaluated independently (participants can choose to enter either or both):
For the Full Task the target predicates in the (test) data set will be annotated with gold standard word senses (frames). The participants have to:

  • find the semantic arguments of the predicate (role recognition)
  • label them with the correct role (role labelling)
  • find links between null instantiations and the wider context (NI linking)

For the NIs only task, participants will be supplied with a test set which is already annotated with gold standard local semantic argument structure; only the referents for null instantiations have to be found.


We will prepare new training and test data consisting of running text from the fiction domain. The data sets will be freely available. The training set for both tasks will be annotated with gold standard semantic argument structure (see for example the FrameNet full text annotation) and linking information for null instantiations. We aim to annotate the semantic argument structures both in FrameNet and PropBank style; participants can choose which one they prefer.

Organizers: Josef Ruppenhofer (Saarland University), Caroline Sporleder (Saarland University), Roser Morante (University of Antwerp), Collin Baker (ICSI, Berkeley), Martha Palmer (University of Colorado, Boulder)
Web Site: http://www.coli.uni-saarland.de/projects/semeval2010_FG/

[ Ranking]

  • Test data release: March 26th
  • Closing competition : April 2nd

#11  Event Detection in Chinese News Sentences 

The goal of the task is to detect and analyze some basic event contents in real world Chinese news texts. It consists of finding key verbs or verb phrases to describe these events in the Chinese sentences after word segmentation and part-of-speech tagging, selecting suitable situation description formula for them, and anchoring different situation arguments with suitable syntactic chunks in the sentence. Three main sub-tasks are as follows:
  1. Target verb WSD: to recognize whether there are some key verbs or verb phrases to describe two focused event contents in the sentence, and select suitable situation description formula for these recognized key verbs (or verb phrases), from a situation network lexicon. The input of the sub-task is a Chinese sentence annotated with correct word-segmentation and POS tags. Its output is the sense selection or disambiguation tags of the target verbs in the sentence.
  2. Sentence SRL: to anchor different situation arguments with suitable syntactic chunks in the sentence, and annotate suitable syntactic constituent and functional tags for these arguments. Its input is a Chinese sentence annotated with correct word-segmentation, POS tags and the sense tags of the target verbs in the sentence. Its output is the syntactic chunk recognition and situation argument anchoring results.
  3. Event detection: to detect and analyze the special event content through the interaction of target verb WSD and sentence SRL. Its input is a Chinese sentence annotated with correct word-segmentation and POS tags. Its output is a complete event description detected in the sentence (if it has a focused target verb).
The following is a detailed example to explain the above procedure: For such a Chinese sentence after word-segmentation and POS tagging:

今天/n(Today) 我/r(I) 在/p(at) 书店/n(bookstore) 买/v(buy) 了/u(-ed) 三/m(three) 本/q 新/a(new) 书/n(book) 。/w (Today, I bought three new books at the bookstore.)

After the first processing stage: target verb WSD, we find there is a possession-transferring verb ‘买/v(buy)’ in the sentence and select the following situation description formula for it:

买/v(buy): DO(x, P(x,y)) CAUSE have(x,y) AND NOT have(z,y) [P=buy]
Then, we anchor four situation arguments with suitable syntactic chunks in the sentence and obtain the following sentence SRL result:

今天/n(Today) [S-np 我/r(I) ]x [D-pp 在/p(at) 书店/n(bookstore) ]z [P-vp 买/v(buy) 了/u(-ed) ]Tgt [O-np 三/m(three) 本/q 新/a(new) 书/n(book) ]y 。/w[2]

Finally, we can get the following situation description for the sentence:

DO(x, P(x,y)) CAUSE have(x,y) AND NOT have(z,y) [x=我/r(I), y=三/m(three) 本/q 新/a(new) 书/n(book), z=书店/n(bookstore), P=买/v(buy)]

Organizers: Qiang Zhou (Tsinghua University, Beijing, China)
Web Site: http://www.ncmmsc.org/SemEval-2010-Task/

[ Ranking]

#12  Parser Training and Evaluation using Textual Entailment 

We propose a targeted textual entailment task designed to train and evaluate parsers. Recent approaches on cross-framework parser evaluation employ framework-independent representations such as GR and SD schemes. However, there is still arbitrariness in the definition of such a scheme and the conversion is problematic. Our approach takes this idea one step further. Correct parse decisions are captured by natural language sentences called textual entailments. Participants make a yes/no choice on a given entailment. It will be possible to automatically decide which entailments are implied based on the parser output only, i.e. there will be no need for lexical semantics, anaphora resolution etc.

- Final-hour trading accelerated to 108.1 million shares, a record for the Big Board.
  • 108.1 million shares was a record. – YES
  • Final-hour trading accelerated a record. – NO
The proposed task is desirable for several reasons. First, textual entailments focus on the semantically meaningful parser decisions. Trivial differences are abstracted away, which should result in a more accurate assessment of parser performance on real-word applications. Second, no formal training is required. Annotation will be easier and annotation errors will have a less detrimental effect on evaluation accuracy. Finally, entailments will be non-trivial since they will be collected by considering the differences between the outputs of different state-of-the-art parsers.
The participants will be provided with development (trial) and test sets of entailments and they will be evaluated using the standard tools and methodology of the RTE challenges. We hope the task will be interesting for participants with Parsing, Semantic Role Labeling, or RTE backgrounds.

Organizers: Deniz Yuret (Koc University)
Web Site: http://pete.yuret.com/

[ Ranking]

  • There will be no training data. The test data will be available on March 26
  • Closing competition : April 2nd

#13  TempEval 2 

Evaluating Events, Time Expressions, and Temporal Relations

Newspaper texts, narratives and other texts describe events occurring in time, explicitly and implicitly specifying the temporal location and order of these events. Text comprehension requires the capability to identify the events described in a text and to locate them in time.

We provide three tasks that are relevant to understanding the temporal structure of a text: (i) identification of events, (ii) identification of time expressions and (iii) identification of temporal relations. The temporal relations task is further structured into four sub tasks, requiring systems to recognize which of a fixed set of temporal relations holds between (a) events and time expressions within the same sentence (b) events and the document creation time (c) main events in consecutive sentences, and (d) two events where one syntactically dominates the other.

Data sets will be provided for five languages: English, Italian, Spanish, Chinese and Korean. The data sets do not comprise a parallel corpus and sizes may range from 25K to 150K tokens. The annotation scheme used is based on TimeML. TimeML (http://www.timeml.org) has been developed over the last decade as a general multilingual markup language for temporal information in texts and is currently vetted as an ISO standard.

Participants can choose any combination of the three main tasks and the five languages.

Tempeval-2 is a follow-up on Tempeval-1, which was an initial evaluation exercise based on three limited temporal relation tasks. See http://www.timeml.org/tempeval-2/ for more information.

Organizers: James Pustejovsky, Marc Verhagen, Nianwen Xue (Brandeis University)
Web Site: http://www.timeml.org/tempeval2/

  • March 12th, first batch of training data
  • March 21st, second batch of training data
  • March 28th, evaluation data
  • April 2nd, close of Tempeval competition

#14  Word Sense Induction 


This task is a continuation of the WSI task (i.e. Task 2) of SemEval 2007 (nlp.cs.swarthmore.edu/semeval/tasks/task02/summary.shtml) with
some significant changes to the evaluation setting.

Word Sense Induction (WSI) is defined as the process of identifying the different senses (or uses) of a target word in a given text in an automatic and fully-unsupervised manner. The goal of this task is to allow comparison of unsupervised sense induction and disambiguation systems. A secondary outcome of this task will be to provide a comparison with current supervised and knowledge-based methods for sense disambiguation. The evaluation scheme consists of the following assessment methodologies:

  • Unsupervised Evaluation. The induced senses are evaluated as clusters of examples, and compared to sets of examples, which have been tagged with gold standard (GS) senses. The evaluation metric used, V-measure (Rosenberg & Hirschberg, 2007), attempts to measure both coverage and homogeneity of a clustering solution, where a perfect homogeneity is achieved if all the clusters of a clustering solution contain only data points, which are elements of a single Gold Standard (GS) class. On the other hand, a perfect coverage is achieved if all the data points, which are members of a given class are also elements of the same cluster. Homogeneity and completeness can be treated in similar fashion to precision and recall, where increasing the former often results in decreasing the latter (Rosenberg & Hirschberg, 2007).

  • Supervised Evaluation. The second evaluation setting, supervised evaluation, assesses WSI systems in a WSD task. A mapping is created between induced sense clusters (from the unsupervised evaluation described above) and the actual GS senses. The mapping matrix is then used to tag each instance in the testing corpus with GS senses. The usual recall/precision measures for WSD are then used. Supervised evaluation was a part of the SemEval-2007 WSI task (Agirre & Soroa,2007).

Andrew Rosenberg and Julia Hirschberg. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) Prague, Czech Republic, (June 2007). ACL.

Eneko Agirre and Aitor Soroa. Semeval-2007 task 02: Evaluating word sense induction and discrimination systems. In Proceedings of the Fourth International Workshop on Semantic Evaluations, pp. 7-12, Prague, Czech Republic, (June 2007). ACL.

Organizers: Suresh Manandhar (University of York), Ioannis Klapaftis (University of York), and Dmitriy Dligach, (University of Colorado)
Web Site: http://www.cs.york.ac.uk/semeval2010_WSI/

[ Ranking]

#15  Infrequent Sense Identification for Mandarin Text to Speech Systems 

There are seven cases of grapheme to phoneme (GTP) in a text to speech (TTS) system (Yarowsky, 1997). Among them, the most difficult task is disambiguating the homograph word, which has the same POS (part of speech) but different pronunciation. In this case, different pronunciations of the same word always correspond to different word senses. Once the word senses are disambiguated, the problem of GTP is resolved.

There is a little different from traditional WSD (word sense disambiguation), in this task two or more senses may correspond to one pronunciation. That is, the sense granularity is coarser than WSD. For example, the preposition “为” has three senses: sense1 and sense2 have the same pronunciation {wei 4}, while sense3 corresponds to {wei 2}. In this task, to the target word, not only the pronunciations but also the sense labels are provided for training; but for test, only the pronunciations are evaluated. The challenge of this task is the much skewed distribution in real text: the most frequent pronunciation occupies usually over 80%.

In this task, we will provide a large volume of training data (each homograph word has at least 300 instances) accordance with the truly distribution in real text. In the test data, we will provide at least 100 instances for each target word. In order to focus on the performance of identifying the infrequent sense, we will intentionally divide the infrequent pronunciation instances and frequent instances half and half in the test dataset. The evaluation method compiles with the precision vs. recall evaluation.

All instances come from People Daily newspaper (the most popular newspaper in Mandarin). Double blind annotations are executed manually, and a third annotator checks the annotation.

Yarowsky, David (1997). “Homograph disambiguation in text-to-speech synthesis.” In van Santen, Jan T. H.; Sproat, Richard; Olive, Joseph P.; and Hirschberg, Julia. Progress in Speech Synthesis. Springer-Verlag, New York, 157-172.

Organizers: Peng Jin, Yunfang Wu and Shiwen Yu (Peking University Beijing, China)
Web Site:

[ Ranking]


  • Test data release: March 25, 2010
  • Result submission deadline: March 29, 2010,
  • Organizers send the test results: April 2, 2010

#16  Japanese WSD 

This task can be considered an extension of SENSEVAL-2 JAPANESE LEXICAL SAMPLE Monolingual dictionary-based task. Word senses are defined according to the Iwanami Kokugo Jiten, a Japanese dictionary published by Iwanami Shoten. Please refer to that task for reference. We think that our task has the following two new characteristics:

1) All previous Japanese sense-tagged corpora were from newspaper articles, while sense-tagged corpora have been constructed in English on balanced corpora, such as Brown corpus and BNC corpus. The first balanced corpus of contemporary written Japanese (BCCWJ corpus) is now being constructed as part of a national project in Japan [Maekawa, 2008], and we are now constructing a sense-tagged corpus on it. Therefore, the task will use the first balanced Japanese sense-tagged corpus.

2) In previous WSD tasks, systems have been required to select a sense from a given set of senses in a dictionary for a word in one context (an instance). However, the set of senses in the dictionary is not always complete. New word senses sometimes appear after the dictionary has been compiled. Therefore, some instances might have a sense that cannot be found in a set in the dictionary. The task will take into account not only the instances having a sense in the given set but also the instances having a sense that cannot be found in the set. In the latter case, systems should output that the instances have a sense that is not in the set.

Organizers: Manabu Okumura (Tokyo Institute of Technology), Kiyoaki Shirai (Japan Advanced Institute of Science and Technology)
Web Site: http://lr-www.pi.titech.ac.jp/wsd.html

#17  All-words Word Sense Disambiguation on a Specific Domain (WSD-domain) 

Domain adaptation is a hot issue in Natural Language Processing, including Word Sense Disambiguation. WSD systems trained on general corpora are known to perform worse when moved to specific domains. WSD-domain task will offer a testbed for domain-specific WSD systems, and will allow to test domain portability issues.

Texts from ECNC and WWF will be used in order to build domain specific test copora (see example below). The data will be available in a number of languages: English, Dutch and Italian, and possibly Basque and Chinese (confirmation pending). The sense inventories will be based on wordnets of the respective languages.

The test data will comprise three documents (6000 word chunk with approx. 2000 target words) for each language. The test data will be annotated by hand using double-blind annotation plus adjudication. Inter-Tagger Agreement will be measured. There will not be training data available, but participants are free to use existing hand-tagged corpora and lexical resources. Traditional precision and recall measures will be used in order to evaluate the participant systems, as implemented in past WSD Senseval and SemEval tasks.

WSD-domain is being developed in the framework of the Kyoto project (http://www.kyoto-project.eu/).

Environment domain text example:
"Projections for 2100 suggest that temperature in Europe will have risen by between 2 to 6.3 °C above 1990 levels. The sea level is projected to rise, and a greater frequency and intensity of extreme weather events are expected. Even if emissions of greenhouse gases stop today, these changes would continue for many decades and in the case of sea level for centuries. This is due to the historical build up of the gases in the atmosphere and time lags in the response of climatic and oceanic systems to changes in the atmospheric concentration of the gases."

Organizers: Eneko Agirre and Oier Lopez de Lacalle (Basque Country University)
Web Site: http://xmlgroup.iit.cnr.it/SemEval2010/


  • Test data release: March 26
  • Closing competition : April 2

#18  Disambiguating Sentiment Ambiguous Adjectives 

Some adjectives are neutral in sentiment polarity out of context, but they show positive, neutral or negative meaning within specific context. Such words can be called dynamic sentiment ambiguous adjectives. For instance, “价格高|the price is high” indicates negative meaning, while “质量高|the quality is high” has positive connotation. Disambiguating sentiment ambiguous adjectives is an interesting task, which is an interaction between word sense disambiguation and sentiment analysis. However in the previous works, sentiment ambiguous words have not been tackled in the field of WSD, and are also discarded crudely by most of the researches concerning sentiment analysis.

This task aims to create a benchmark dataset for disambiguating dynamic sentiment ambiguous adjectives. The sentiment ambiguous words are pervasive in many languages. In this task we concentrate on Chinese, but we think, the disambiguating techniques should be language-independent. Together 14 dynamic sentiment ambiguous adjectives are selected, which are all high-frequency words in Mandarin Chinese. They are: 大|big, 小|small, 多|many, 少|few, 高|high, 低|low, 厚|thick, 薄|thin, 深|deep, 浅|shallow, 重|heavy, 轻|light, 巨大|huge, 重大|grave.

The dataset contains two parts. Some sentences containing these target adjectives will be extracted from Chinese Gigaword (LDC corpus: LDC2005T14). And the other sentences will be gathered through the search engine like Google. Firstly these sentences will be automatically segmented and POS-tagged. And then the ambiguous adjectives are manually annotated with the correct sentiment polarity within the sentence context. Two human annotators will annotate the sentences double blindly. The third annotator will check the annotation.

This task will be carried out in an unsupervised setting, and consequently no training data will be provided. All the data of about 4,000 sentences will be provides as the test set. Evaluation will be performed in terms of the usual precision, recall and F1 scores.

Organizers: Yunfang Wu, Peng Jin, Miaomiao Wen and Shiwen Yu (Peking University, Beijing, China)
Web Site:

[ Ranking]


  • Test data release: March 23, 2010
  • Result submission deadline: postponed at March 27, 2010, 4 days after downloading the test data
  • Organizers send the test results: April 2, 2010

© 2008 FBK-irst  |  internal area