This work built and released the first evaluation corpus for Japanese paraphrase identification, which comprises 655 sentence pairs, and proposes a novel sentential paraphrase acquisition method that uses multiple machine translation systems and a monolingual corpus to extract negative candidates.
We propose a novel sentential paraphrase acquisition method. To build a well-balanced corpus for Paraphrase Identifi-cation, we especially focus on acquiring both non-trivial positive and negative instances. We use multiple machine translation systems to generate positive candidates and a monolingual corpus to extract negative candidates. To collect non-trivial instances, the candidates are uniformly sampled by word overlap rate. Finally, annotators judge whether the candidates are either positive or negative. Using this method, we built and released the first evaluation corpus for Japanese paraphrase identification, which comprises 655 sentence pairs.