情報処理学会第77回全国大会講演要旨

5N-04

Content Reuse Detection in Text Documents

○王　　沛，肖　　川，石川佳治（名大）

Text document collection typically contains reused information. Events or facts may be restated with modifications by various sources. Identifying text reuse may help to find original sources of facts and track information flow, and thus becomes an important task in text analysis. In this paper, we study the problem of content reuse detection in text documents. Existing methods are usually sensitive to modifications such as paraphrases and miss many meaningful results in this case. We propose a new method that tolerates a considerable amount of differences in reused contents. A prefix-filtering-based algorithm is devised for efficient reuse detection. Experiment evaluation on real datasets demonstrates that our method outperforms alternative solutions in terms of both effectiveness and efficiency.

情報処理学会 第77回全国大会講演要旨

情報処理学会第77回全国大会講演要旨