RuDSI is a new benchmark for word sense induction (WSI) in Russian created using manual annotation and semi-automatic clustering of Word Usage Graphs (WUGs) using no external word senses imposed on annotators.
We present RuDSI, a new benchmark for word sense induction (WSI) in Russian. The dataset was created using manual annotation and semi-automatic clustering of Word Usage Graphs (WUGs). RuDSI is completely data-driven (based on texts from Russian National Corpus), with no external word senses imposed on annotators. We present and analyze RuDSI, describe our annotation workflow, show how graph clustering parameters affect the dataset, report the performance that several baseline WSI methods obtain on RuDSI and discuss possibilities for improving these scores.
Elisey Rykov
1 papers