Task #2: Cross-lingual Lexical Substitution

The tables show two metrics, best, and oot, and then the systems are ranked according to recall. The metrics as well as the mode variations are described in our documentation. The oot tables contain an additional column showing the number of duplicates used by that particular participant.

The rank order of systems changes based on which measures are used. Also note that the system responses have not yet been analyzed to see the relative strengths and weaknesses of the different systems. For example, IRST-1 and IRSTbs did considerably better on precision compared to recall since they did not cover all test items.

As another example, note that UBA-T has the highest ranking for the mode scores in oot.

BEST

Systems	R	P	Mode R	Mode P
UBA-T	27.15	27.15	57.20	57.20
USPWLV	26.81	26.81	58.85	58.85
ColSlm	25.99	27.59	56.24	59.16
WLVUSP	25.27	25.27	52.81	52.81
SWAT-E	21.46	21.46	43.21	43.21
UvT-v	21.09	21.09	43.76	43.76
CU-SMT	20.56	21.62	44.58	45.01
UBA-W	19.68	19.68	39.09	39.09
UvT-g	19.59	19.59	41.02	41.02
SWAT-S	18.87	18.87	36.63	36.63
ColEur	18.15	19.47	37.72	40.03
IRST-1	15.38	22.16	33.47	45.95
IRSTbs	13.21	22.51	28.26	45.27
TYO	8.39	8.62	14.95	15.31

BEST baselines

Systems	R	P	Mode R	Mode P
DICT	24.34	24.34	50.34	50.34
DICTCORP	15.09	15.09	29.22	29.22

OOT

Systems	R	P	Mode R	Mode P	dups
SWAT-E	174.59	174.59	66.94	66.94	968
SWAT-S	97.98	97.98	79.01	79.01	872
UvT-v	58.91	58.91	62.96	62.96	345
UvT-g	55.29	55.29	73.94	73.94	146
UBA-W	52.75	52.75	83.54	83.54	-
WLVUSP	48.48	48.48	77.91	77.91	64
UBA-T	47.99	47.99	81.07	81.07	-
USPWLV	47.60	47.60	79.84	79.84	30
ColSlm	43.91	46.61	65.98	69.41	509
ColEur	41.72	44.77	67.35	71.47	125
TYO	34.54	35.46	58.02	59.16	-
IRST-1	31.48	33.14	55.42	58.30	-
FCC-LS	23.90	23.90	31.96	31.96	308
IRSTbs	8.33	29.74	19.89	64.44	-

OOT baselines

Systems	R	P	Mode R	Mode P	dups
DICT	44.04	44.04	73.53	73.53	30
DICTCORP	42.65	42.65	71.60	71.60	-