Evaluation Report
The
report is for SemEval-2010 Task #11.
Task Name: Event detection in Chinese news sentences
Evaluation measures:
For
the WSD subtask, we give two evaluation measures: WSD-Micro-Accuracy and WSD-Macro-Accuracy
for the target verbs in the sentences. The formulas are as follows:
l WSD-Micro-Accuracy =
Number of correctly-analyzed target verbs / Number of all target verbs * 100%
l WSD-Macro-Accuracy = å Micro-Accuracy * wi, wi
=
frequency of the target verb in test set / total target verb frequency in test
set
The correct results should match
the following conditions: the selected situation description formula and
natural explanation text of the target verbs will be same with the
gold-standard codes.
We evaluated 27 multiple-sense
target verbs in the test set.
For the SRL subtask, we give three
evaluation measures: Chunk-Precision, Chunk-Recall, and Chunk-F-measure. The
formulas are as follows:
l Chunk-Precision =
Number of correctly-analyzed chunks / Number of all recognized to chunks * 100%
l Chunk-Recall = Number
of correctly-analyzed chunks / Number of gold-standard chunks * 100%
l
Chunk-F-measure = (Chunk-P + Chunk-R) / (2*(Chunk-P *
Chunk-R))
The correct results should match
all the following conditions:
l The recognized chunks
should have the same boundaries with the gold-standard argument chunks of the
key verbs or verb phrases.
l The recognized chunks
should have the same syntactic constituent and functional tags with the
gold-standard ones.
l The recognized chunks
should have the same situation argument tags with the gold-standard ones.
We only
select the key argument chunks (with semantic tags: x, y, z, L or O) for
evaluation.
For the
event detection task, we give two evaluation measures: Event-Micro-Accuracy and
Event-Macro-Accuracy. The formulas are as follows:
l Event-Micro-Accuracy =
Number of correctly-analyzed events for a target / Number of all events for a
target verb in test set * 100%
l Event-Macro-Accuracy =
å Micro-Accuracy * wi, wi
=
frequency of a target verb in test set / total target verb frequency in test
set
The correct results should match
all the following conditions:
l The event situation
description formula and natural explanation text of the target verb should be
same with the gold-standard ones.
l All the argument
chunks of the event descriptions should be same with the gold-standard ones.
l The number of the
recognized argument chunks should be same with the gold-standard one.
We received 7 uploaded results
for evaluation. The following is the evaluation result table. All the results are ranked with
Event-Macro-Accuracy.
System
ID |
WSD-Micro-A |
WSD-Macro-A |
Chunk-P |
Chunk-R |
Chunk-F |
Event-Micro-A |
Enent-Macro-A |
Rank |
480a |
89.59 |
87.54 |
80.91 |
77.91 |
79.38 |
53.76 |
52.12 |
1 |
480b |
89.18 |
87.24 |
80.91 |
76.95 |
78.88 |
52.05 |
50.59 |
2 |
109 |
70.64 |
73.00 |
63.50 |
57.39 |
60.29 |
23.05 |
22.85 |
3 |
347 |
83.81 |
81.30 |
58.33 |
53.32 |
55.71 |
20.19 |
20.33 |
4 |
348 |
82.18 |
79.23 |
58.33 |
53.32 |
55.71 |
20.23 |
20.05 |
5 |
350 |
81.42 |
77.74 |
58.33 |
53.32 |
55.71 |
20.22 |
20.05 |
6 |
349 |
82.58 |
79.82 |
58.33 |
53.32 |
55.71 |
20.14 |
20.05 |
7 |