|
Evaluation Exercises on Semantic Evaluation - 2010
ACL SigLex event
TASKS
|
Available tasks: 18
|
|
|
#1 Coreference
Resolution in Multiple Languages
Description
Using coreference information has been shown to be beneficial in a number
of NLP applications including Information Extraction, Text Summarization,
Question Answering and Machine Translation. This task is concerned with
intra-document coreference resolution for six different languages:
Catalan, Dutch, English, German, Italian and Spanish. The complete task
is divided into two subtasks for each of the languages:
1. Detection of full coreference chains,
composed by named entities, pronouns, and full noun phrases.
2. Pronominal resolution, i.e., finding the
antecedents of the pronouns in the text.
Data
Data is provided for both
statistical training and evaluation, which extract the coreference chains
from manually annotated corpora: the AnCora corpora for Catalan and
Spanish, the OntoNotes corpus for English, the TüBa-D/Z for German, the
KNACK corpus for Dutch, and the LiveMemories corpus for Italian,
additionally enriched with morphological, syntactic and semantic
information (such as gender, number, constituents, dependencies,
predicates, etc.). Great effort has been devoted to provide the
participants with a common and relatively simple data representation for
all the languages.
Goals
The main goal is to perform and
evaluate coreference resolution for six different languages with the help
of other layers of linguistic information and using different evaluation
metrics (MUC, B-CUBED, CEAF and BLANC).
1. The multilingual context will allow to study
the portability of coreference resolution systems across languages. To
what extent is it possible to implement a general system that is portable
to all six languages? How much language-specific tuning is necessary? Are
there significant differences between Germanic and Romance languages? And
between languages of the same family?
2. The additional layers of annotation will
allow to study how helpful morphology, syntax and semantics are to solve
coreference relations. How much preprocessing is needed? How much does
the quality of the preprocessing modules (perfect linguistic input vs.
noisy automatic input) affect the performance of state-of-the-art
coreference resolution systems? Is morphology more helpful than syntax?
Or semantics? Or is syntax more helpful than semantics?
3. The use of four different evaluation metrics
will allow to compare the advantages and drawback of the generally used
MUC, B-CUBED and CEAF measures, as well as the newly proposed BLANC
measure. Do all of them provide the same ranking? Are they correlated?
Can systems be optimized under all four metrics at the same time?
Evaluation
Two different scenarios will be
considered for evaluation. In the first one, gold‐standard annotation
will be provided to participants (up to full syntax and possibly
including also semantic role labeling). This input annotation will
correctly identify all noun phrases that are part of the coreference
chains. In the second scenario we will use state‐ of‐the‐art automatic
linguistic tools to generate the input annotation of the data. In this
second scenario, the matching between the automatically generated
structure and the real NPs intervening in the chains does not need to be
perfect. By defining these two experimental settings, we will be able to
check the effectiveness of state‐of‐the‐art coreference resolution
systems when working with perfect linguistic (syntactic/semantic)
information and the degradation in performance when moving to a realistic
scenario. In parallel, we will also differentiate between closed and open
settings, that is, when participants are allowed to use strictly the
information contained in the training data (closed) and when they make
use of some external resources/tools (open).
Organizers: Veronique Hoste, Lluis Marquez, M. Antonia Marti,
Massimo Poesio, Marta Recasens, Emili Sapena, Mariona Taule, Yannick
Versley. (Universitat de Barcelona, Universitat Politècnica de Catalunya,
Hogeschool Gent, Università di Trento, Universität Tübingen)
Web Site: http://stel.ub.edu/semeval2010-coref/
[ Ranking]
Timeline:
·
Training data
release : February 11th
·
Test data
release: March 20th
·
Time
constraint: Upload the results no more
than 7 days after downloading the test set
·
Closing
competition : April 2nd
|
|
|
|
|
|
#2 Cross-Lingual
Lexical Substitution
Description
The goal of this task is to
provide a framework for the evaluation of systems for cross-lingual lexical
substitution. Given a paragraph and a target word, the goal is to provide
several correct translations for that word in a given language, with the
constraint that the translations fit the given context in the source
language. This is a follow-up of the English
lexical substitution task from SemEval-2007 (McCarthy and Navigli,
2007), but this time the task is cross-lingual.
While there are connections
between this task and the task of automatic machine translation, there
are several major differences. First, cross-lingual lexical substitution
targets one word at a time, rather than an entire sentence as machine
translation does. Second, in cross-lingual lexical substitution we seek
as many good translations as possible for the given target word, as
opposed to just one translation, which is the typical output of machine translation.
There are also connections between this task and a word sense
disambiguation task which uses distinctions in translations for word
senses (Resnik and Yarowsky, 1997) however in this task we do not
restrict the translations to those in a specific parallel corpus; the
annotators and systems are free to choose the translations from any
available resource. Also, we do not assume a fixed grouping of
translations to form "senses" and so it is possible that any
token instance of a word may have translations in common with other token
instances that are not themselves directly related.
Given a paragraph and a target
word, the task is to provide several correct translations for that word
in a given language. We will use English as the source language and
Spanish as the target language.
References
·
Diana McCarthy
and Roberto Navigli (2007) SemEval-2007
Task 10: English Lexical Substitution Task. In Proceedings
of the 4th International Workshop on Semantic Evaluations (SemEval-2007), Prague,
Czech Republic, pp.48-53.
·
Philip Resnik and
David Yarowsky (2000) "Distinguishing
Systems and Distinguishing Senses: New Evaluation Methods for Word Sense
Disambiguation", Natural Language Engineering 5(2), pp.
113-133.
Organizers: Rada Mihalcea (University of North Texas), Diana
McCarthy (University of Sussex), Ravi Sinha (University of North Texas)
Web Site: http://lit.csci.unt.edu/index.php/Semeval_2010
[ Ranking]
Timeline:
·
Test data
availability: 1 March - 2 April , 2010
·
Result submission
deadline: within 7 days after downloading the *test* data.
·
Closing
competition for this task: 2 April
|
|
|
|
|
|
#3 Cross-Lingual
Word Sense Disambiguation
Description
There is a general feeling in the WSD community that WSD should not be
considered as an isolated research task, but should be integrated in real
NLP applications such as Machine translation or multilingual IR. Using
translations from a corpus instead of human defined (e.g. WordNet) sense
labels, makes it easier to integrate WSD in multilingual applications,
solves the granularity problem that might be task-dependent as well, is
language-independent and can be a valid alternative for languages that
lack sufficient sense-inventories and sense-tagged corpora.
We propose an Unsupervised Word
Sense Disambiguation task for English nouns by means of parallel corpora.
The sense label is composed of translations in the different languages
and the sense inventory is built up by three annotators on the basis of
the Europarl parallel
corpus by means of a concordance tool. All translations
(above a predefined frequency threshold) of a polysemous word are grouped
into clusters/"senses" of that given word.
Languages: English - Dutch,
French, German, Italian, Spanish
Subtasks:
1. Bilingual Evaluation
(English - Language X)
Example:
[English] ... equivalent to giving fish to people living on the [bank] of
the river ...
Sense Label = {oever/dijk} [Dutch]
Sense Label = {rives/rivage/bord/bords} [French]
Sense Label = {Ufer} [German]
Sense Label = {riva} [Italian]
Sense Label = {orilla} [Spanish]
2. Multi-lingual Evaluation
(English - all target languages)
Example:
... living on the [bank] of the river ...
Sense Label = {oever/dijk, rives/rivage/bord/bords, Ufer, riva, orilla}
Resources
As the task is formulated as an
unsupervised WSD task, we will not annotate any training material.
Participants can use the Europarl corpus that is freely available and
that will be used for building up the sense inventory.
For the test data, native speakers will decide on the correct translation
cluster(s) for each test sentence and give their top-3 translations from
the predefined list of Europarl translations, in order to assign weights
to the translations from the answer clusters for that test sentence.
Participants will receive manually annotated development and test data:
·
Development/sample
data: 5 polysemous English nouns, each with 20 example instances
·
Test data: 20
polysemous English nouns (selected from the test data as used in the
lexical substitution task), each with 50 test instances
Evaluation
The evaluation will be done
using precision and recall. We will perform both a "best
result" evaluation (the first translation returned by a system) and
a more relaxed evaluation for the "top ten" results (the first
ten translations returned by a system).
Organizers: Els Lefever and Veronique Hoste (University
College Ghent, Belgium)
Web Site: http://webs.hogent.be/~elef464/lt3_SemEval.html
[ Ranking]
Timeline:
·
Test data
availability: 22 March - 25 March , 2010
·
Result submission
deadline: within 4 days after downloading the *test*
data.
|
|
|
|
#4 VP
Ellipsis - Detection and Resolution
Description
Verb Phrase Ellipsis (VPE)
occurs in the English language when an auxiliary or modal verb
abbreviates an entire verb phrase recoverable from the linguistic context,
as in the following examples:
·
Both Dr. Mason
and Dr. Sullivan [oppose federal funding for abortion], as does President
Bush, except in cases where a woman's life is threatened.
·
They also said
that vendors were [delivering goods] more quickly in October than
they had for each of the five previous months.
·
He spends his
days [sketching passers-by], or trying to.
Here occurrences of VPE are
typeset in a bold face font. The antecedent is marked by square brackets.
The Task
The proposed shared task
consists of two subtasks: (1) automatically detecting VPE in free text;
and (2) selecting the textual antecedent of each found VPE. Task 1 is
reasonably difficult (Nielsen 2004 reports an F-score of 71% on Wall
Street Journal data).
Task 2 is challenging. With a
"head match" evaluation Hardt 1997 reports a success rate of
62% for a baseline system based on recency only, and an accurracy of 84%
for an improved system taking recency, clausal relations, parallelism,
and quotation into account. We will make the task more realistic (but
more difficult) by not using head match but rather precision and recall
over each token of the antecedent.
We will provide texts where
sentence boundaries are detected and each sentence is tokenised and
printed on a new line. An occurrence of VPE is marked by a line number
plus token positions of the auxiliary or modal verb. Textual antecedents
are assumed to be on one line, and are marked by the line number plus
begin/end token position.
The Data
As development data we will
provide the stand-off annotation of more than 500 occurrences of manually
annotated VPE in the Wall Street Journal part (all 25 sections) of the
Penn Treebank. We have made an arrangement with the Linguistic Data
Consortium that participants without access to the Penn Treebank can use
the raw texts for the duration of the shared task.
We will also produce a script
that calculates precision and recall of detection and the average F-score
and accuracy of antecedent selection based on overlap with a gold
standard antecedent.
The test data will be a further
collection of newswire (or similar genre) articles. The "gold"
standard of the test data will be determined by using the merged results
of all task participants. Additionally, these will be manually judged by
the organisers.
References
Daniel Hardt (1997): An
Empirical Approach to VP Ellipsis. Computational Linguistics 23(4).
Leif A. Nielsen (2004): Verb
phrase ellipsis detection using automatically parsed text.
Proceedings of the 20th international Conference on Computational
Linguistics (Geneva, Switzerland).
Organizers: Johan Bos (University of Rome "La
Sapienza") and Jennifer Spenader (University of Groningen)
Web Site: http://www.sigsem.org/wiki/SemEval_2010:_VP_Ellipsis_Processing
|
|
|
|
#5 Automatic
Keyphrase Extraction from Scientific Articles
Description
Keyphrases are words that capture the main topic of the document. As
keyphrases represent the key ideas of documents, extracting good
keyphrases benefits various natural language processing (NLP)
applications, such as summarization, information retrieval (IR) and
question-answering (QA). In summarization, the keyphrases can be used as
a semantic metadata. In search engines, keyphrases can supplement
full-text indexing and assist users in creating good queries. Therefore,
the quality of keyphrases has a direct impact on the quality of
downstream NLP applications.
Recently, several systems and
techniques have been proposed to extract keyphrases. Hence, we propose a
shared task in order to provide the chance to compete and benchmark such
technologies.
In the shared task, the
participants will be provided with set of scientific articles and will be
asked to produce the keyphrases for each article.
The organizers will provide
trial, train and test data. The average length of the articles is between
6 and 8 pages including tables and pictures. We will provide two sets of
answers: author-assigned keyphrases and reader-assigned keyphrases. All
reader-assigned keyphrases will be extracted from the papers whereas some
of author-assigned keyphrases may not occur in the content.
The answer set contains
lemmatized keyphrases. We also accept two alternation of keyphrase: A of
B -> B A (e.g. policy of school = school policy) and A's B (e.g.
school's policy = school policy). However, in case that the semantics has
been changed due to the alternation, we do not include the alternation as
the answer set.
In this shared task, we follow
the traditional evaluation metric. That is, we match the keyphrases in
the answer sets (i.e. author-assigned keyphrases and reader-assigned
keyphrases) with those participants provide and calculate precision,
recall and F-score. Then finally, we will rank the participants by
F-score.
The Google-group for the task
is at http://groups.google.com.au/group/semeval2010-keyphrase?lnk=gcimh&pli=1
Organizers: Su Nam Kim (University of Melbourne), Olena
Medelyan (University of Waikato), Min-yen Kan (National University of
Singapore), Timothy Baldwin (University of Melbourne)
Web Site: http://docs.google.com/Doc?id=ddshp584_46gqkkjng4
[ Ranking]
Timeline:
·
Test and training
data release : Feb. 15th (Monday)
·
Closing
competition : March 19th (5 weeks for competition)
(Friday)
·
Results out :
by March 31st
·
Submission of
description papers: April 17, 2010
·
Notification of
acceptance: May 6, 2010
·
Workshop: July
15-16, 2010 ACL Uppsala
|
|
|
|
#6 Classification
of Semantic Relations between MeSH Entities in Swedish Medical
Texts
Description
Task cancelled
There is a growing interest and, consequently, a volume of publications
related to the topic of relation classification in the medical domain.
Algorithms for classifying semantic relations have potential applications
in many language technology applications and there has been a renewed
interest during the last years. If such semantic relations can be
determined, the potential of obtaining more accurate results for systems
and applications such as Information Retrieval and Extraction,
Summarization, Question Answering, etc. increases, particularly since
searching to mere co-occurrence of terms is unfocused and does not by any
means guarantee that there can be a relation between the identified terms
of interest. For instance, knowing the relationship that prevails between
a medication and a disease or symptom should be useful for searching free
text and easier obtaining answers to questions such as “What is the
effect of treatment with substance X to the disease Y?”, Our task "Classification
of Semantic Relations between MeSH Entities in Swedish Medical
Texts" deals with the classification of semantic
relations between pairs of MeSH entities/annotations. We focus on
three entity types: DISEASES/SYMPTOMS (category C in the MeSH hierarchy),
CHEMICAL and DRUGS/ANALYTICAL, DIAGNOSTIC AND THERAPEUTIC TECHNIQUES AND
EQUIPMENT (categories D and E in the MeSH hierarchy). The evaluation task
is similar to the SEMEVAL-1/Task#4 by Girju et al.: Classification
of Semantic Relations between Nominals. This implies that the evaluation
methodology to be used will include similar evaluation criteria already
developed (the
SEMEVAL-1/Task#4). The datasets for the task will consist of
annotated sentences with relevant MeSH entities, including the
surrounding context for the investigated entities and their relation
within a window size of one to two preceding and one to two following
sentences. We plan to have about nine semantic relations with approx.
100-200 training sentences and 50-100 testing sentences per
relation.
Organizers: Dimitrios Kokkinakis (University of Gothenburg),
Dana Dannells (University of Gothenburg), Hercules Dalianis (Stockholm
University)
Web Site: http://demo.spraakdata.gu.se/svedk/semeval/
|
|
|
|
#7 Argument
Selection and Coercion
Description
Task Description
This task involves identifying
the compositional operations involved in argument selection. Most
annotation schemes to date encoding propositional or predicative content
have focused on the identification of the predicate type, the argument
extent, and the semantic role (or label) assigned to that argument by the
predicate. In contrast, this task attempts to capture the
"compositional history" of the argument selection relative to
the predicate. In particular, this task attempts to identify the
operations of type adjustment induced by a predicate over its arguments
when they do not match its selectional properties. The task is defined as
follows: for each argument of a predicate, identify whether the entity in
that argument position satisfies the type expected by the predicate. If
not, then one needs to identify how the entity in that position satisfies
the typing expected by the predicate; that is, to identify the source and
target types in a type-shifting (or coercion) operation. The possible
relations between the predicate and a given argument will, for this task,
be restricted to selection and coercion. In selection, the argument NP
satisfies the typing requirements of the predicate. For example, in the
sentence "The child threw the ball", the object NP "the
ball" directly satisfies the type expected by the predicate,
Physical Object. If this is not the case, then a coercion has occurred.
For example, in the sentence "The White House denied this
statement.", the type expected in subject position by the predicate
is Human, but the surface NP is typed as Location. The task is to
identify both the type mismatch and the type shift; namely Location ->
Human.
Resources and Corpus
Development
The following methodology will
be followed in corpus creation: (1) A set of selection contexts will be
chosen; (2) A set of sentences will be randomly selected for each chosen
context; (3) The target noun phrase will be identified in each sentence,
and a composition type determined in each case; (4) In cases of coercion,
the source and target types for the semantic head of each relevant noun
phrase will be identified. We will perform double annotation and
adjudication over the corpus.
Evaluation Methodology
Precision and recall will be
used as evaluation metrics. A scoring program will be supplied for
participants. Two subtasks will be evaluated separately: (1) identifying
the argument type and (2) identifying the compositional operation (i.e.
selection vs. coercion).
References
J. Pustejovsky, A. Rumshisky,
J. L. Moszkowicz, and O. Batiukova. 2009. Glml: Annotating argument
selection and coercion. IWCS-8.
Organizers: James Pustejovsky, Nicoletta Calzolari, Anna
Rumshisky, Jessica Moszkowicz, Elisabetta Jezek, Valeria Quochi, Olga
Batiukova
Web Site: http://asc-task.org/
Timeline:
·
11/10/09 - Trial
data for English and Italian posted
·
3/10/10 -
Training data for English and Italian released
·
3/27/10 - Test
data for English and Italian released
·
4/02/10 - Closing
competition
|
|
|
|
|
|
#8 Multi-Way
Classification of Semantic Relations Between Pairs of Nominals
Description
Recently, the NLP community has shown a renewed interest in deeper
semantic analyses, among them automatic recognition of semantic relations
between pairs of words. This is an important task with many potential
applications including but not limited to Information Retrieval,
Information Extraction, Text Summarization, Machine Translation, Question
Answering, Paraphrasing, Recognizing Textual Entailment, Thesaurus
Construction, Semantic Network Construction, Word Sense Disambiguation,
and Language Modelling.
Despite the interest, progress was slow due to incompatible
classification schemes, which made direct comparisons hard. In addition,
most datasets provided no context for the target relation, thus relying
on the assumption that semantic relations are largely
context-independent, which is often false. A notable exception is
SemEval-2007 Task 4 (Girju&al.,2007), which for the first time
provided a standard benchmark dataset for seven semantic relations in
context. However, this dataset treated each relation separately, asking
for positive vs. negative classification decisions. While some subsequent
publications tried to use the dataset in a multi-way setup, it was not
designed to be used in that manner.
We believe that having a freely available standard benchmark dataset for
*multi-way* semantic relation classification *in context* is much needed
for the overall advancement of the field. That is why we pose as our
primary objective the task of preparing and releasing such a dataset to
the research community.
We will use nine mutually exclusive relations from Nastase &
Szpakowicz (2003). Тhe dataset for the task will consist of annotated
sentences, gathered from the Web and manually marked -- with indicated
nominals and relations. We will provide 1000 examples for each relation,
which is a sizeable increase over the SemEval-2007 Task 4, where
there were about 210 examples for each of the seven relations. There will
be also a NONE relation, for which we will have 1000 examples as well.
Using that dataset, we will set up a common evaluation task that will
enable researchers to compare their algorithms. The official evaluation
score will be average F1 over all relations, but we will also check
whether some relations are more difficult to classify than others, and
whether some algorithms are best suited for certain types of relations.
Trial data and an automatic scorer will be made available well in advance
(by June 2009). All data will be released under a Creative Commons
license.
Organizers: Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva,
Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti,
Lorenza Romano, Stan Szpakowicz. Contact: Preslav Nakov.
Web Site: http://docs.google.com/View?docid=dfvxd49s_36c28v9pmw
[ Ranking]
Timeline:
·
Trial data
released : August 30, 2009
·
Training data
release: February 26 , 2010
·
Test data
release: March 18 , 2010
·
Result submission
deadline: within seven days after downloading the *test* data, but not
later than April 2
·
Organizers send
test results: April 10, 2010
|
|
|
|
#9 Noun
Compound Interpretation Using Paraphrasing Verbs
Description
Noun compounds -- sequences of nouns acting as a single noun, e.g. colon
cancer -- are abundant in English. Understanding their syntax
and semantics is challenging but important for many NLP applications,
including but not limited to Question Answering, Machine Translation,
Information Retrieval and Information Extraction. For example, a
question-answering system might need to determine whether protein
acting as a tumor suppressor is a good paraphrase for tumor
suppressor protein, and an information extraction system might need
to decide whether neck vein thrombosisand neck
thrombosis could possibly co-refer when used in the same
document. Similarly, a machine translation system facing the unknown noun
compound WTO Geneva headquarters might benefit from
being able to paraphrase it as Geneva headquarters of the WTO or
as WTO headquarters located in Geneva. Given a query like
"migraine treatment", an information retrieval system could use
paraphrasing verbs like relieve and prevent for
page ranking and query refinement.
We will explore the idea of using paraphrasing verbs and prepositions for
noun compound interpretation. For example, nut bread can
be paraphrased using verbs like contain and include,
prepositions like with, and verbs+prepositions like be
made from. Unlike traditional abstract relations such as CAUSE,
CONTAINER, and LOCATION, verbs and prepositions are directly usable as
paraphrases, and using several of them simultaneously yields an appealing
fine-grained semantic representation.
We will release as trial/development data paraphrasing verbs and
prepositions for 250 compounds, manually picked by 25-30 human subjects.
For example, for nut bread we have the following
paraphrases (the number of subjects who proposed each paraphrase is in
parentheses):
contain(21); include(10); be
made with(9); have(8); be made from(5); use(3); be made using(3);
feature(2); be filled with(2); taste like(2); be made of(2); come
from(2); consist of(2); hold(1); be composed of(1); be blended with(1);
be created out of(1); encapsulate(1); diffuse(1); be created with(1); be
flavored with(1), ...
Given a compound and a set of paraphrasing verbs and prepositions, the
participants must provide a ranking that is as close as possible to the
one proposed by human raters. Trial data and an automatic scorer will be
made available well in advance (by June 2009). All data will be released
under a Creative Commons license.
Organizers: Ioanacristina Butnariu, Su Nam Kim, Preslav
Nakov, Diarmuid Ó Séaghdha, Stan Szpakowicz, Tony Veale. Contact:
Preslav Nakov
Web Site: http://docs.google.com/View?docid=dfvxd49s_35hkprbcpt
[ Ranking]
Timeline:
·
Trial data
released : August 30, 2009
·
Training data
release: February 17 , 2010
·
Test data
release: March 18 , 2010
·
Result submission
deadline: within seven days after downloading the *test* data, but not
later than April 2
·
Organizers send
test results: April 10, 2010
|
|
|
|
|
|
#10 Linking
Events and their Participants in Discourse
Description
Semantic role labelling (SRL)
has traditionally been viewed as a sentence-internal problem. However, it
is clear that there is an interplay between local semantic argument
structure and the surrounding discourse. In this shared task, we would
like to take SRL of nominal and verbal predicates beyond the domain of
isolated sentences by linking local semantic argument structures to the
wider discourse context. In particular, we aim to find fillers for roles
which are left unfilled in the local context (null instantiations, NIs).
An example is given below, where the "charges" role
("arg2" in PropBank) of cleared is left empty
but can be linked to murder in the previous
sentence.
In a lengthy court case the
defendant was tried for murder. In the end, he was cleared.
Tasks:
There will be two tasks, which
will be evaluated independently (participants can choose to enter either
or both):
For the Full Task the target predicates in the (test)
data set will be annotated with gold standard word senses (frames). The
participants have to:
·
find the semantic
arguments of the predicate (role recognition)
·
label them with
the correct role (role labelling)
·
find links between
null instantiations and the wider context (NI linking)
For the NIs only task, participants will be supplied
with a test set which is already annotated with gold standard local
semantic argument structure; only the referents for null instantiations
have to be found.
Data:
We will prepare new training
and test data consisting of running text from the fiction domain. The
data sets will be freely available. The training set for both tasks will
be annotated with gold standard semantic argument structure (see for
example the FrameNet
full text annotation) and linking information for null
instantiations. We aim to annotate the semantic argument structures both
in FrameNet and PropBank style;
participants can choose which one they prefer.
Organizers: Josef Ruppenhofer (Saarland University), Caroline
Sporleder (Saarland University), Roser Morante (University of Antwerp),
Collin Baker (ICSI, Berkeley), Martha Palmer (University of Colorado,
Boulder)
Web Site: http://www.coli.uni-saarland.de/projects/semeval2010_FG/
[ Ranking]
Timeline:
·
Test data
release: March 26th
·
Closing
competition : April 2nd
|
|
|
|
|
|
#11 Event
Detection in Chinese News Sentences
Description
The goal of the task is to detect and analyze some basic event contents
in real world Chinese news texts. It consists of finding key verbs or
verb phrases to describe these events in the Chinese sentences after word
segmentation and part-of-speech tagging, selecting suitable situation
description formula for them, and anchoring different situation arguments
with suitable syntactic chunks in the sentence. Three main sub-tasks are
as follows:
1. Target verb WSD: to recognize whether there
are some key verbs or verb phrases to describe two focused event contents
in the sentence, and select suitable situation description formula for
these recognized key verbs (or verb phrases), from a situation network
lexicon. The input of the sub-task is a Chinese sentence annotated with
correct word-segmentation and POS tags. Its output is the sense selection
or disambiguation tags of the target verbs in the sentence.
2. Sentence SRL: to anchor different situation
arguments with suitable syntactic chunks in the sentence, and annotate
suitable syntactic constituent and functional tags for these arguments.
Its input is a Chinese sentence annotated with correct word-segmentation,
POS tags and the sense tags of the target verbs in the sentence. Its
output is the syntactic chunk recognition and situation argument
anchoring results.
3. Event detection: to detect and analyze the
special event content through the interaction of target verb WSD and
sentence SRL. Its input is a Chinese sentence annotated with correct
word-segmentation and POS tags. Its output is a complete event
description detected in the sentence (if it has a focused target verb).
The following is a detailed
example to explain the above procedure: For such a Chinese sentence after
word-segmentation and POS tagging:
今天/n(Today)
我/r(I) 在/p(at)
书店/n(bookstore) 买/v(buy) 了/u(-ed)
三/m(three) 本/q
新/a(new) 书/n(book) 。/w
(Today, I bought three new books at the bookstore.)
After the first processing
stage: target verb WSD, we find there is a possession-transferring verb ‘买/v(buy)’
in the sentence and select the following situation description formula
for it:
买/v(buy):
DO(x, P(x,y)) CAUSE have(x,y) AND NOT have(z,y)
[P=buy]
Then, we anchor four situation arguments with suitable syntactic chunks
in the sentence and obtain the following sentence SRL result:
今天/n(Today)
[S-np 我/r(I) ]x [D-pp 在/p(at) 书店/n(bookstore) ]z [P-vp 买/v(buy) 了/u(-ed) ]Tgt [O-np 三/m(three)
本/q 新/a(new)
书/n(book) ]y 。/w[2]
Finally, we can get the
following situation description for the sentence:
DO(x, P(x,y)) CAUSE have(x,y)
AND NOT have(z,y) [x=我/r(I), y=三/m(three) 本/q 新/a(new) 书/n(book), z=书店/n(bookstore), P=买/v(buy)]
Organizers: Qiang Zhou (Tsinghua University, Beijing, China)
Web Site: http://www.ncmmsc.org/SemEval-2010-Task/
[ Ranking]
|
|
|
|
|
|
#12 Parser
Training and Evaluation using Textual Entailment
Description
We propose a targeted textual entailment task designed to train and
evaluate parsers. Recent approaches on cross-framework parser evaluation
employ framework-independent representations such as GR and SD schemes.
However, there is still arbitrariness in the definition of such a scheme
and the conversion is problematic. Our approach takes this idea one step
further. Correct parse decisions are captured by natural language
sentences called textual entailments. Participants make a yes/no choice
on a given entailment. It will be possible to automatically decide which
entailments are implied based on the parser output only, i.e. there will
be no need for lexical semantics, anaphora resolution etc.
- Final-hour trading accelerated to 108.1 million shares, a record for
the Big Board.
·
108.1 million
shares was a record. – YES
·
Final-hour
trading accelerated a record. – NO
The proposed task is desirable
for several reasons. First, textual entailments focus on the semantically
meaningful parser decisions. Trivial differences are abstracted away,
which should result in a more accurate assessment of parser performance
on real-word applications. Second, no formal training is required.
Annotation will be easier and annotation errors will have a less
detrimental effect on evaluation accuracy. Finally, entailments will be
non-trivial since they will be collected by considering the differences
between the outputs of different state-of-the-art parsers.
The participants will be provided with development (trial) and test sets
of entailments and they will be evaluated using the standard tools and
methodology of the RTE challenges. We hope the task will be interesting
for participants with Parsing, Semantic Role Labeling, or RTE
backgrounds.
Organizers: Deniz Yuret (Koc University)
Web Site: http://pete.yuret.com/
[ Ranking]
Timeline:
·
There will be no
training data. The test data will be available on March 26
·
Closing
competition : April 2nd
|
|
|
|
#13 TempEval
2
Description
Evaluating Events, Time Expressions, and Temporal Relations
Newspaper texts, narratives and
other texts describe events occurring in time, explicitly and implicitly
specifying the temporal location and order of these events. Text
comprehension requires the capability to identify the events described in
a text and to locate them in time.
We provide three tasks that are
relevant to understanding the temporal structure of a text: (i)
identification of events, (ii) identification of time expressions and
(iii) identification of temporal relations. The temporal relations task
is further structured into four sub tasks, requiring systems to recognize
which of a fixed set of temporal relations holds between (a) events and
time expressions within the same sentence (b) events and the document
creation time (c) main events in consecutive sentences, and (d) two
events where one syntactically dominates the other.
Data sets will be provided for
five languages: English, Italian, Spanish, Chinese and Korean. The data
sets do not comprise a parallel corpus and sizes may range from 25K to
150K tokens. The annotation scheme used is based on TimeML. TimeML (http://www.timeml.org) has been
developed over the last decade as a general multilingual markup language
for temporal information in texts and is currently vetted as an ISO
standard.
Participants can choose any
combination of the three main tasks and the five languages.
Tempeval-2 is a follow-up on
Tempeval-1, which was an initial evaluation exercise based on three
limited temporal relation tasks. See http://www.timeml.org/tempeval-2/ for
more information.
Organizers: James Pustejovsky, Marc Verhagen, Nianwen Xue
(Brandeis University)
Web Site: http://www.timeml.org/tempeval2/
Timeline:
·
March 12th, first
batch of training data
·
March 21st,
second batch of training data
·
March 28th,
evaluation data
·
April 2nd, close
of Tempeval competition
|
|
|
|
#14 Word
Sense Induction
Description
This task is a continuation of
the WSI task (i.e. Task 2) of SemEval 2007 (nlp.cs.swarthmore.edu/semeval/tasks/task02/summary.shtml)
with
some significant changes to the evaluation setting.
Word Sense Induction (WSI) is defined as the process of identifying the
different senses (or uses) of a target word in a given text in an
automatic and fully-unsupervised manner. The goal of this task is to
allow comparison of unsupervised sense induction and disambiguation
systems. A secondary outcome of this task will be to provide a comparison
with current supervised and knowledge-based methods for sense disambiguation.
The evaluation scheme consists of the following assessment methodologies:
·
Unsupervised
Evaluation. The induced senses are
evaluated as clusters of examples, and compared to sets of examples,
which have been tagged with gold standard (GS) senses. The evaluation
metric used, V-measure (Rosenberg & Hirschberg, 2007), attempts to
measure both coverage and homogeneity of a clustering solution, where a
perfect homogeneity is achieved if all the clusters of a clustering
solution contain only data points, which are elements of a single Gold
Standard (GS) class. On the other hand, a perfect coverage is achieved if
all the data points, which are members of a given class are also elements
of the same cluster. Homogeneity and completeness can be treated in similar
fashion to precision and recall, where increasing the former often
results in decreasing the latter (Rosenberg & Hirschberg, 2007).
·
Supervised
Evaluation. The second evaluation
setting, supervised evaluation, assesses WSI systems in a WSD task. A
mapping is created between induced sense clusters (from the unsupervised
evaluation described above) and the actual GS senses. The mapping matrix
is then used to tag each instance in the testing corpus with GS senses.
The usual recall/precision measures for WSD are then used. Supervised
evaluation was a part of the SemEval-2007 WSI task (Agirre &
Soroa,2007).
References
Andrew Rosenberg and Julia
Hirschberg. V-Measure: A Conditional Entropy-Based External Cluster
Evaluation Measure. Proceedings of the 2007 Joint Conference on Empirical
Methods in Natural Language Processing and Computational Natural Language
Learning (EMNLP-CoNLL) Prague, Czech Republic, (June 2007). ACL.
Eneko Agirre and Aitor Soroa.
Semeval-2007 task 02: Evaluating word sense induction and discrimination
systems. In Proceedings of the Fourth International Workshop on Semantic
Evaluations, pp. 7-12, Prague, Czech Republic, (June 2007). ACL.
Organizers: Suresh Manandhar (University of York), Ioannis
Klapaftis (University of York), and Dmitriy Dligach, (University of
Colorado)
Web Site: http://www.cs.york.ac.uk/semeval2010_WSI/
[ Ranking]
|
|
|
|
#15 Infrequent
Sense Identification for Mandarin Text to Speech Systems
Description
There are seven cases of grapheme to phoneme (GTP) in a text to speech
(TTS) system (Yarowsky, 1997). Among them, the most difficult task is
disambiguating the homograph word, which has the same POS (part of
speech) but different pronunciation. In this case, different
pronunciations of the same word always correspond to different word
senses. Once the word senses are disambiguated, the problem of GTP is
resolved.
There is a little different
from traditional WSD (word sense disambiguation), in this task two or
more senses may correspond to one pronunciation. That is, the sense
granularity is coarser than WSD. For example, the preposition “为” has
three senses: sense1 and sense2 have the same pronunciation {wei 4},
while sense3 corresponds to {wei 2}. In this task, to the target word,
not only the pronunciations but also the sense labels are provided for
training; but for test, only the pronunciations are evaluated. The
challenge of this task is the much skewed distribution in real text: the
most frequent pronunciation occupies usually over 80%.
In this task, we will provide a
large volume of training data (each homograph word has at least 300
instances) accordance with the truly distribution in real text. In the
test data, we will provide at least 100 instances for each target word.
In order to focus on the performance of identifying the infrequent sense,
we will intentionally divide the infrequent pronunciation instances and
frequent instances half and half in the test dataset. The evaluation
method compiles with the precision vs. recall evaluation.
All instances come from People
Daily newspaper (the most popular newspaper in Mandarin). Double blind
annotations are executed manually, and a third annotator checks the
annotation.
References:
Yarowsky, David (1997). “Homograph disambiguation in text-to-speech
synthesis.” In van Santen, Jan T. H.; Sproat, Richard; Olive, Joseph P.;
and Hirschberg, Julia. Progress in Speech Synthesis. Springer-Verlag, New
York, 157-172.
Organizers: Peng Jin, Yunfang Wu and Shiwen Yu (Peking
University Beijing, China)
Web Site:
[ Ranking]
Timeline:
·
Test data
release: March 25, 2010
·
Result submission
deadline: March 29, 2010,
·
Organizers send
the test results: April 2, 2010
|
|
|
|
#16 Japanese
WSD
Description
This task can be considered an extension of SENSEVAL-2 JAPANESE LEXICAL
SAMPLE Monolingual dictionary-based task. Word senses are defined
according to the Iwanami Kokugo Jiten, a Japanese dictionary published by
Iwanami Shoten. Please refer to that task for reference. We think that
our task has the following two new characteristics:
1) All previous Japanese sense-tagged
corpora were from newspaper articles, while sense-tagged corpora have
been constructed in English on balanced corpora, such as Brown corpus and
BNC corpus. The first balanced corpus of contemporary written Japanese
(BCCWJ corpus) is now being constructed as part of a national project in
Japan [Maekawa, 2008], and we are now constructing a sense-tagged corpus
on it. Therefore, the task will use the first balanced Japanese
sense-tagged corpus.
2) In previous WSD tasks, systems have
been required to select a sense from a given set of senses in a
dictionary for a word in one context (an instance). However, the set of
senses in the dictionary is not always complete. New word senses
sometimes appear after the dictionary has been compiled. Therefore, some
instances might have a sense that cannot be found in a set in the
dictionary. The task will take into account not only the instances having
a sense in the given set but also the instances having a sense that
cannot be found in the set. In the latter case, systems should output
that the instances have a sense that is not in the set.
Organizers: Manabu Okumura (Tokyo Institute of Technology),
Kiyoaki Shirai (Japan Advanced Institute of Science and Technology)
Web Site: http://lr-www.pi.titech.ac.jp/wsd.html
|
|
|
|
#17 All-words
Word Sense Disambiguation on a Specific Domain (WSD-domain)
Description
Domain adaptation is a hot issue in Natural Language Processing, including
Word Sense Disambiguation. WSD systems trained on general corpora are
known to perform worse when moved to specific domains. WSD-domain task
will offer a testbed for domain-specific WSD systems, and will allow to
test domain portability issues.
Texts from ECNC and WWF will be
used in order to build domain specific test copora (see example below).
The data will be available in a number of languages: English, Dutch and
Italian, and possibly Basque and Chinese (confirmation pending). The
sense inventories will be based on wordnets of the respective
languages.
The test data will comprise
three documents (6000 word chunk with approx. 2000 target words) for each
language. The test data will be annotated by hand using double-blind
annotation plus adjudication. Inter-Tagger Agreement will be measured.
There will not be training data available, but participants are free to
use existing hand-tagged corpora and lexical resources. Traditional
precision and recall measures will be used in order to evaluate the
participant systems, as implemented in past WSD Senseval and SemEval
tasks.
WSD-domain is being developed
in the framework of the Kyoto project (http://www.kyoto-project.eu/).
Environment domain text example:
"Projections for 2100 suggest that temperature in Europe will have
risen by between 2 to 6.3 °C above 1990 levels. The sea level is
projected to rise, and a greater frequency and intensity of extreme
weather events are expected. Even if emissions of greenhouse gases stop
today, these changes would continue for many decades and in the case of
sea level for centuries. This is due to the historical build up of the
gases in the atmosphere and time lags in the response of climatic and
oceanic systems to changes in the atmospheric concentration of the
gases."
Organizers: Eneko Agirre and Oier Lopez de Lacalle (Basque
Country University)
Web Site: http://xmlgroup.iit.cnr.it/SemEval2010/
Timeline:
·
Test data
release: March 26
·
Closing
competition : April 2
|
|
|
|
|
|
#18 Disambiguating
Sentiment Ambiguous Adjectives
Description
Some adjectives are neutral in sentiment polarity out of context, but
they show positive, neutral or negative meaning within specific context.
Such words can be called dynamic sentiment ambiguous adjectives. For
instance, “价格高|the price is high” indicates negative
meaning, while “质量高|the
quality is high” has positive connotation. Disambiguating sentiment
ambiguous adjectives is an interesting task, which is an interaction
between word sense disambiguation and sentiment analysis. However in the
previous works, sentiment ambiguous words have not been tackled in the
field of WSD, and are also discarded crudely by most of the researches
concerning sentiment analysis.
This task aims to create a
benchmark dataset for disambiguating dynamic sentiment ambiguous
adjectives. The sentiment ambiguous words are pervasive in many
languages. In this task we concentrate on Chinese, but we think, the
disambiguating techniques should be language-independent. Together 14
dynamic sentiment ambiguous adjectives are selected, which are all
high-frequency words in Mandarin Chinese. They are: 大|big, 小|small,
多|many, 少|few,
高|high, 低|low,
厚|thick, 薄|thin,
深|deep, 浅|shallow,
重|heavy, 轻|light, 巨大|huge, 重大|grave.
The dataset contains two parts.
Some sentences containing these target adjectives will be extracted from
Chinese Gigaword (LDC corpus: LDC2005T14). And the other sentences will
be gathered through the search engine like Google. Firstly these
sentences will be automatically segmented and POS-tagged. And then the
ambiguous adjectives are manually annotated with the correct sentiment
polarity within the sentence context. Two human annotators will annotate
the sentences double blindly. The third annotator will check the
annotation.
This task will be carried out
in an unsupervised setting, and consequently no training data will be
provided. All the data of about 4,000 sentences will be provides as the
test set. Evaluation will be performed in terms of the usual precision,
recall and F1 scores.
Organizers: Yunfang Wu, Peng Jin, Miaomiao Wen and Shiwen Yu
(Peking University, Beijing, China)
Web Site:
[ Ranking]
Timeline:
·
Test data
release: March 23, 2010
·
Result submission
deadline: postponed at March 27, 2010, 4 days
after downloading the test data
·
Organizers send
the test results: April 2, 2010
|
|
|