Evaluation Report

 

The report is for SemEval-2010 Task #11.

 

Task Name: Event detection in Chinese news sentences

Evaluation measures:

       For the WSD subtask, we give two evaluation measures: WSD-Micro-Accuracy and WSD-Macro-Accuracy for the target verbs in the sentences. The formulas are as follows:

l  WSD-Micro-Accuracy = Number of correctly-analyzed target verbs / Number of all target verbs * 100%

l  WSD-Macro-Accuracy = å Micro-Accuracy * wi, wi = frequency of the target verb in test set / total target verb frequency in test set

The correct results should match the following conditions: the selected situation description formula and natural explanation text of the target verbs will be same with the gold-standard codes.

We evaluated 27 multiple-sense target verbs in the test set.

 

For the SRL subtask, we give three evaluation measures: Chunk-Precision, Chunk-Recall, and Chunk-F-measure. The formulas are as follows:

l  Chunk-Precision = Number of correctly-analyzed chunks / Number of all recognized to chunks * 100%

l  Chunk-Recall = Number of correctly-analyzed chunks / Number of gold-standard chunks * 100%

l  Chunk-F-measure = (Chunk-P + Chunk-R) / (2*(Chunk-P * Chunk-R))

The correct results should match all the following conditions:

l  The recognized chunks should have the same boundaries with the gold-standard argument chunks of the key verbs or verb phrases.

l  The recognized chunks should have the same syntactic constituent and functional tags with the gold-standard ones.

l  The recognized chunks should have the same situation argument tags with the gold-standard ones.

We only select the key argument chunks (with semantic tags: x, y, z, L or O) for evaluation.

 

For the event detection task, we give two evaluation measures: Event-Micro-Accuracy and Event-Macro-Accuracy. The formulas are as follows:

l  Event-Micro-Accuracy = Number of correctly-analyzed events for a target / Number of all events for a target verb in test set * 100%

l  Event-Macro-Accuracy = å Micro-Accuracy * wi, wi = frequency of a target verb in test set / total target verb frequency in test set

The correct results should match all the following conditions:

l  The event situation description formula and natural explanation text of the target verb should be same with the gold-standard ones.

l  All the argument chunks of the event descriptions should be same with the gold-standard ones.

l  The number of the recognized argument chunks should be same with the gold-standard one.

 

We received 7 uploaded results for evaluation. The following is the evaluation result table. All the results are ranked with Event-Macro-Accuracy.

 

System ID

WSD-Micro-A

WSD-Macro-A

Chunk-P

Chunk-R

Chunk-F

Event-Micro-A

Enent-Macro-A

Rank

480a

89.59

87.54

80.91

77.91

79.38

53.76

52.12

1

480b

89.18

87.24

80.91

76.95

78.88

52.05

50.59

2

109

70.64

73.00

63.50

57.39

60.29

23.05

22.85

3

347

83.81

81.30

58.33

53.32

55.71

20.19

20.33

4

348

82.18

79.23

58.33

53.32

55.71

20.23

20.05

5

350

81.42

77.74

58.33

53.32

55.71

20.22

20.05

6

349

82.58

79.82

58.33

53.32

55.71

20.14

20.05

7